Skip to content

-->

Bft: Use MySQL to store Hierarchical Schemaless Data

Last year Brett Taylor wrote an interesting article about how FriendFeed uses MySQL to store schema-less data.

Our datastore stores schema-less bags of properties (e.g., JSON objects or Python dictionaries). The only required property of stored entities is id, a 16-byte UUID. The rest of the entity is opaque as far as the datastore is concerned. We can change the "schema" simply by storing new properties.

We index data in these entities by storing indexes in separate MySQL tables. If we want to index three properties in each entity, we will have three MySQL tables - one for each index. If we want to stop using an index, we stop writing to that table from our code and, optionally, drop the table from MySQL. If we want a new index, we make a new MySQL table for that index and run a process to asynchronously populate the index without disrupting our live service.

This seemed quite cool, so I've been working on a little weekend project that implements something similar. It's called Bft, as in "Big Fat Table". If you're interested, the code is in my public SVN repository:

http://svn.happy.cat/public/bft/Zend/lib/Bft

Features

  • Supports sharded databases through simple INI configuration.
  • Entities are stored as JSON encoded arrays of data along with a UUID.
  • Implemented as a Zend plugin using Zend_Db.
  • Supports ad-hoc indexes called 'collections'. Each collection can be related to a parent entity, and are therefore heirarchical.
  • Management of shards by implementing your own ConnectionManager.

Bonus: ReSTful API

Given that indexes can be created on-the-fly, I made a simple ReST API to create, modify, retrieve and delete entities and collections. See documentation below.

Usage

Create a new instance of Bft

Alternatively, you can use your own connection manager to control how entities and collections map to database shards.

Create a new entity

Print the UUID of the contact

Create a collection, and add the entity

List the entities in a collection:

List the IDs of entities in a collection

Check if a collection contains a particular entity

Using the API

Installation:

  1. Create a new host, and point the DocumentRoot at Api/public
  2. Create config file Api/config.ini (see Api/config.example.ini)
  3. That's it.

URLs contain collection names, starting with a dash ('-') or entity IDs (UUIDs). Here are some examples using Daniel Stenberg's curl utility.

  1. Create an entity within the collection '-contacts'
  2. Retrieve the contact
  3. Add a contact to the group "-contacts" AND the sub-group '-friends':
  4. List all contacts
  5. List only friends
  6. Update friend details
  7. Delete Robert from the friends list
  8. Note that Robert is still stored in the contacts list. To delete completely, you must delete him from the root list (contacts)

Tests do exist but the coverage is low at the moment and the code is slightly raw. Please send in bugs and patches to adam at happy dot cat!


How to build a distributed social network

There has been a lot of recent interest in projects that aim to replace big, centralized social networks (like Facebook, Xanga and LinkedId) with decentralized, open alternaives that give users greater control and choice. It's interesting to wonder if they will succeed, and how it might be done.

Some notable efforts include Diaspora which is getting lots of interest following a highly successful funding drive on Kickstarter; Appleseed which is attracting lots of developer activity; GNU Social and many others which you can berate me in the comments for not mentioning :-)

A plethora of protocols already exist to serve the needs of an open social network, including XMPP (the basis for Google Wave), OAuth and OpenId for authentication, and OpenSocial, and for data transfer we've got a huge choice: XFN, JSON, ReST, SOAP, RSS and ATOM, to name a few.

Uptake: users, experimenters and developers

It seems that all the parts are sitting there, waiting to be assembled together into something that everyone will use. But it's not that simple.

A distributed social network (DSN) will face two main problems before it can begin to rival the big closed social networks: user takeup and install base. The first will happen when people see a genuine benefit of using the DSN over the walled garden they currently use. It won't be enough to simply mimic the features of Facebook, Myspace or LinkedIn. At the end of the day, ends will matter more than means but there are ways that a DSN can take advantage of its architecture to do things that closed systems can't (more about that later).

As for privacy, and a DSN will have to provide simple, reliable control over personal data. A system which bewilders users with lots of fine-grained data control options will not meet this need. People will naturally assume a level of privacy similar to email.

The install base will make or break a DSN. It would be wise to follow the example of Wordpress, and make it extremely easy to setup, install and hack about with. Wordpress succeeded by leveraging the popularity of PHP and MySQL which have enough developers to rival the People's Liberation Army. Today, scripting languages such as Python and Ruby are in a similar position to PHP 10 years ago. VPS's are going the way of shared hosting 10 years ago. You can get your own virtual Linux server with a decent spec for under 20 USD - I have one with Linode [note: affiliate link]. A DSN should run on virtually any shared hosting environment and be easy to install and configure, even for people relatively new to unix / linux.

Data, storage & upgrades

When Facebook or LinkedIn roll out a new version, they can do so simultaneously for everyone. A DSN will not have this luxury, and will need to provide interpolation between systems running different versions. And when security holes are discovered, it will at times be necessary to perform swift upgrades, or block access to other systems running outdated software.

In my experience, one of the more painful parts of performing upgrades is upgrading the database. This usually involves a massive transactional script that creates and alters columns and tables, and shuffling data around to fit in to the new format. Another method is by exporting all the data, altering the database schema, and re-importing. In either case, you would need to support upgrades between any two versions of your database, which will swiftly become a maintainance nightmare.

Updating code is simple. You can just checkout the latest version, and point your webserver at the new diectory. The bottleneck for upgrades is the database, so a successful DSN will probably use some kind of schemaless datastore that changes as little as possible between different versions.

Security and reputation

When Diaspora released their eagerly-awaited source code people were quick to point out on the lack of security. Patrick McKenzie of Bingo Card Creator remarked:

The bottom line is currently there is nothing that you cannot do to someone’s Diaspora account, absolutely nothing.

I feel sorry for the Diaspora guys in this respect, because after the overwhelming response to their Kickstarter funding drive (even Mark Zuckerberk donated to the project) the pressure was on them to produce some usable software. Which they did very successfully, unfortunately some security considerations had to take a back seat to achieve this. And despite pointing out that it was a developer release, and they were aware of security holes and bugs, the project received heavy criticism.

A general consensus arose that data security is not something that can be tacked on to a system, but needs to be built in from the start. While this sentiment has merit, I'm not sure it's entirely true. It's certainly much harder to add security later, but not impossible. With a bit of elbow grease, a lot of refactoring and a battery of tests, anything can be achieved.

So for the sake of ease of development and project reputation, a DSN should build in security from the beginning and pay close attention to security and privacy issues. Also:

  • Have a special 'VIP' route for collecting security related bugs, and make it easy to do. Think: security@my-dsn-project.org
  • Respond swiftly to these issues. Get on top of the issue before The Register does.
  • Get people with security experience to work on the project
  • Develop a battery of security-related automated tests
  • Be open and transparent about security issues which affect the DSN
  • Clearly distinguish between versions considered secure and non-secure

TL;DR

I had planned to end this post by talking about the features it would be cool to have in a DSN, and also about more nuts-and-bolts issues, but I think I'll save them for another day. Here's a quick summary:

  1. Don't just mimic the features of Facebook, Myspace or LinkedIn. Do something new and compelling
  2. provide simple, reliable control over personal data
  3. Make it as easy as Wordpress to install, setup and hack
  4. Provide interpolation between versions
  5. Make it easy to upgrade
  6. Build security in from the beginning. Make it easy to report security issues and handle them swiftly.

Sixty-nine byte haiku




I write just for you,
A sixty-nine byte haiku
It rhymes too! pad pad.





Rear View URLs

Rear-view mirror

The past couple of weeks I've been helping out my neighbour set up an internet connection in his home. He's over 70 years old and this his first ever experience with email and navigating the web. It's been a real eye-opener to see him getting got grips with it all (to be honest, not very well so far) and my heart aches a bit as he stuggles with basic internet concepts.

One of the first things I did was set him up an email account, and we practiced sending each other messages. I showed him how to email his distant relatives in Brasil, but he's afraid to do so, because the telephone company might charge him extra. I've tried explaining that's not the case, but the psycological link between distance and price is hard to break.

When I returned for my second visit (to restart the router) I noticed he'd somehow managed to disactivate the location bar in Firefox, so I put it back. However, a few days later I returned to find the location bar as absent as before. He said that it was basically an over-technical distraction, so he disabled it in the menu.

It seemed wrong to me, and at first I couldn't explain why. Eventually I told him that the location bar is basically like a rear-view mirror in a car. Sometimes you can end up on a site that isn't what it seems, and a quick glance at the address above the page will let you know if it's safe to proceed.

It seemed a nice comparison, but he didn't seem very convinced. Neither am I, come to think of it.


Getting the rowcount of a Sqlite resultset with SqlAlchemy

How database inconsistencies lead to a hack which makes my chest swell with pride.

I came accross an interesting problem this morning when I was trying to get the rowcount of a Sqlite resultset. When using the MySQL engine, you can use the rowcount property to count the number of rows returned from a query, but strangely the same operation gives -1 when using a sqlite database.

At first, I thought the problem was due to SqlAlchemy which I was using, and quickly made a script to isolate and compare:

Which results in:

  Testing mysql & sqlite rowcount with SqlAlchemy v0.5.8
  Creating schemas
  Inserting some data
  all_mysql_rows.rowcount: 1
  len(all_mysql_rows.fetchall()): 1
  all_sqlite_rows.rowcount: -1
  len(all_sqlite_rows.fetchall()): 1

After some digging around, I discovered that the problem is infact due to pysqlite, and I quote:

Well this was a bummer, because I wanted to write code that I could switch between Sqlite and MySQL - using MySQL in production/development, and sqlite in-memory databases for unit-testing etc. Then it occurred to me: code for unit testing doesn't have to be extremely efficient, so perhaps I could monkey-patch SqlAlchemy's ResultProxy to make up for sqlite's deficiancy...

The hack


Web hooks for ejabberd and making XMPP bots

Making XMPP chat-bots as web applications

Over summer last year I'd been mucking around a bit with XMPP, Erlang and ejabberd. I wanted to quickly make some XMPP chat-bots and, being a web-developer, it occurred to me it would be cool to have a http backend to a chat server, so that I could treat incoming messages as simple HTTP requests. The idea was to:

  • have a standard XMPP server handle everything related to federation, S2S communication, encryption and authentication;
  • develop the bot as a traditional web app, that receives messages and spits out replies.

So I set about making an http-rpc module for ejabberd. Basically, the idea is that it would do the same as mod_rest but backwards: mod_rest allows you to POST stanzas to ejabberd via HTTP, but I wanted to post stanzas received by ejabberd to a restful webservice. I called this RPC gateway mod_motion (as in, the opposite of mod_rest :-).

Diagram of mod_motion

So here's a little guide to how I made and how you can install mod_motion. First off, you will need to install the ejabberd server and some other packages:

I don't want to go too much into setting up an ejabberd XMPP server, but if you are running your chat server publicly, you will need to create some DNS SRV records to make it 'discoverable' by other chat servers, as defined in rfc3920. They should look something like this:

_xmpp-client._tcp  	SRV  	5 0 5222 yourdomain.com.
_xmpp-server._tcp 	SRV 	5 0 5269 yourdomain.com.

Now to the module. If you'd like to read a great introduction to making modules for ejabberd, have a look at these articles by Anders Conberes:

  1. Compiling Erlang
  2. Generic Modules
  3. HTTP Modules
  4. XMPP Bots

You'll also have to get hold of a copy of the ejabberd source in order to compile the module:

You can download the module from my public SVN repository:

Mod_motion works by simply posting all messages, presence and iq stanzas to a web server defined as BASE_URL in the module.

So presence stanzas will be posted to:

http://127.0.0.1:8080/presence/user@host

Likewise, messages will be posted to:

http://127.0.0.1:8080/message/user@host

And IQs get forwarded via HTTP POST to:

http://127.0.0.1:8080/iq/user@host

In each case, user@host is the JID of the remote user.

To install it you will just need to modify the modules section in /etc/ejabberd/ejabberd.cfg thus:

Now to compile and install the module with the script install.sh.

Here's a simple echobot, implemented with web.py; see also the examples folder.

Here's the source of mod_motion:

Credit: echo_bot.erl by Anders Conbere.


What Model-View-Controller really means

Model-View-Controller mental model Model-View-Controller is an architectural pattern commonly used in software applications, which works like this:

  • The model provides a domain-specific representation of the data used by the application.
  • The view renders data to externally-usable formats, typically UI elements.
  • The controller accepts and handles foreign inputs (for example, user input), performs relevant operations on models and initiates a response.

If a recruiter asks you to explain MVC, you can recite these three points and ace the interview. But I think it kind of misses the whole point of MVC.

According to Trygve Reenskaug, who first described MVC in 1979: "The essential purpose of MVC is to bridge the gap between the human user's mental model and the digital model that exists in the computer...MVC was conceived as a general solution to the problem of users controlling a large and complex data set."

So MVC is as much about usability as it is about system architecture. Unlike other design patterns (such as those described by the Gang of Four), MVC is an 'outward looking' pattern, that applies to an entire system. It's original purpose is to help users to understand the working of a system by providing a consistant mapping between the user's mental model, and the domain- or business-logic.

It's hardly surprising then, that MVC has become such a popular pattern; it helps developers understand systems as much as it helps users.

There are about 1 million frameworks available that implement MVC along with related gadgetry: UI templating systems, Object-relational mappers etc. Some frameworks provide strict enforcement of MVC's "rules", for example by prohibiting access to models from within the views. Whichever MVC-system we use, we shouldn't lose sight of the original aim — to bridge the gap between the human user's mental model and the digital model. Something which no framework can do for you.

Update: at the behest of inn0 I encourage you to have a look at Trygve Reenskaug's article about the origins of MVC - it's a great read.


Data.gov.uk: Where's the Data?

The new data.gov.uk site has been launched which open up government for reuse by companies and individuals. Sounds great!

  1. Government publishes data using "open standards, open source and open data"
  2. Geeks from accross the whole country get to work analysing, cross-referencing and building cool applications
  3. Everyone wins

The problem is it's still quite difficult to find really usable data. For example, take the first data set: 2008 Injury Road Traffic Collisions in Northern Ireland. What we actually get is a link to a landing page on the Police Service of Norther Ireland website with links to statistics on everything from crime statistics to Workforce Composition Figures.

And it turns out they're all PDFs, a next-to-useless format for data processing.

Also, I have found many pages that are simply a placeholder for future data, such as the page on the Annual abstract of statistics which currently states "There is currently no text in this page."

Now I think it's a great start, and there are already some pretty cool apps available, but I think that data.gov.uk could do better job of distinguishing between usable datasets, and placeholders or pdf reports.


Blogging with PyBlosxom

pyBlosxom logo This blog started a few weeks ago as a standard Wordpress blog, and I quickly discovered I wanted something a bit more 'interesting' as platform.

Sure, Wordpress ticks all the boxes and is increadibly easy to use, but part of me wanted something a little more ... nerdy

One of the things that I dislike about many blogging platforms is the very fact that they are web-based. But it's not so much the web interface that annoys me so much as the workflow you are forced into. Rather than managing content through a WYSIWYG editor, I'd much rather like to edit my posts with my favourite text editor, on any computer, and remotely manage everything. I'd also like version control, and be able to see the history of all my posts and additions.

I first experimented with Blosxom which is based on the simple idea of dropping text files into a directory, and a single perl script does the rest. However, I wanted something I could hack around with and I knew that if I went the perl route my blog would soon be abandoned. Wait! I'm not a perl hater, I just wanted somethying a bit more ... pythonic :)

Enter PyBlosxom. Practically the same, but written in python, PyBlosxom doesn't seem have as many users or plugins, but it does a good job all the same. I quickly ported the Carrington theme from Wordpress set about configuring it.

I wrote a little script called genblog.sh that generates the whole site statically, and manages all the css and other static files. I can edit my blog posts wherever I am and simply commit them into svn. A simple "deployment" script is all it takes to update the site:

Additionally, PyBlosxom allows you to use different configurations by passing a command-line option to the pyblosxom.py script. This way I am able to generate a 'preview' of new posts before I publish them, for example:

You'll notice that the whole site is plain vanilla HTML. To be honest, disqus is much better that anything I could locally myself, so there was no need to run PyBlosxom as a CGI script, but that is also an option.

I think it would be nice to see more develpment of things like PyBlosxom. Some people use github as a blog for the same reasons - a weblog with a hacker-friendly workflow. What do you use?


Web-scraping and Geo-locating Ticket Restaurants

For a number of years I have been a user of various restaurant ticket schemes. These are 'cheques' given out by employers to allow people to buy lunch in a participating restaurant. These scremes are quite popular in Spain and there are three major systems in use (along with others): Sodexo Pass, Ticket Restaurant and Cheque Gourmet.

One annoying problem is the difficulty of finding restaurants that participate in any of these schemes. It would be nice to have different restaurants marked on a map, but each service offeres a fairly low-quality restaurant finder. In this web-scraping example, our aim is to scrape as much restaurant information as we can from different sources, and compile it all into a MySQL database. Useful information might include the name of the restaurant or bar, it's address, phone number and geographical coordinates. Lets go!

NB: all the code for this exercise is available in my public SVN repository:
http://svn.happy.cat/public/restaurants/trunk/
You will need to install some dependencies: see install-deps.sh

The basic strategy involves 3 steps:

  1. Download raw data from public website
  2. Extract meaningful information
  3. Merge and load into useable data store
To do the downloading we use simple shell scripts and wget; for example:

Easy peasy. In the Mobiticket case (a wap service offered by Ticket Restaurant) we need two steps, first to download an index and then to download the individual restaurant pages one by one. To avoid clobbering these services, we put a delay between each HTTP request.

Now comes the tricky part. To extract meaningful data from the HTML or WML that we downloaded, we revert to using python and some simple regular expressions. We could have probably used BeautifulSoup or some other kind of XML processing along with XPATH. I throw my hands up - I have no defence. In fact, if you are really getting serious about scraping, have a look at Scrapy.

Anyway, the files *-extract-to-tab.py simply churn through the data and spit out tabular data. For example:

So we run our scripts like this:

And we should have three tabular files which we can import into our database.

At this stage we will need to set up a database table to store the data and create a config file. See the example config and setup the table with the restaurants.sql script.

That should give us a nice big table of some 66,725 restaurants. Olé!

But wouldn't it be nice to be able to put these on a map? Fortunately, we've been able to pinpoint around 10,000 restaurants because mobiticket included a map in their application, so we could scrape the latitude and longitude. But this accounts for less than 18% of the restaurants we know about. However, Google provides a reverse geocoding service which we can use to extract the coordinates for a given street address.

To use this service you need an API key which you should put in your config.py. Then run:

and go and have a cup of tea. Infact you'll have to wait quite a while, because your database contains around 45,000 restaurants that need to be geo-located, and Google places a limit of 15,000 geocode requests in a 24 hour period from a single IP address. For this reason we have to put a 6 second delay between each API request.

Anyway, I hope you enjoyed this little detour into web scraping. As you can see, the basic pattern is:

  • download data from various sources
  • process and normalize
  • collate into a central store

Like I mentioned, there are frameworks like Scrapy that make it easy; personally I like to split the work into various smaller steps, so that I can introduce 'save points'. In Scrapy remember to do this in the item pipeline. There's a great tutorial to get you going.