I write just for you,
A sixty-nine byte haiku
It rhymes too! pad pad.
Sixty-nine byte haiku
Posted on Thu, 25 Mar 2010 | permanent link to this entry | View / Add comments
Rear View URLs
The past couple of weeks I've been helping out my neighbour set up an internet connection in his home. He's over 70 years old and this his first ever experience with email and navigating the web. It's been a real eye-opener to see him getting got grips with it all (to be honest, not very well so far) and my heart aches a bit as he stuggles with basic internet concepts.
One of the first things I did was set him up an email account, and we practiced sending each other messages. I showed him how to email his distant relatives in Brasil, but he's afraid to do so, because the telephone company might charge him extra. I've tried explaining that's not the case, but the psycological link between distance and price is hard to break.
When I returned for my second visit (to restart the router) I noticed he'd somehow managed to disactivate the location bar in Firefox, so I put it back. However, a few days later I returned to find the location bar as absent as before. He said that it was basically an over-technical distraction, so he disabled it in the menu.
It seemed wrong to me, and at first I couldn't explain why. Eventually I told him that the location bar is basically like a rear-view mirror in a car. Sometimes you can end up on a site that isn't what it seems, and a quick glance at the address above the page will let you know if it's safe to proceed.
It seemed a nice comparison, but he didn't seem very convinced. Neither am I, come to think of it.
Posted on Thu, 04 Mar 2010 | permanent link to this entry | View / Add comments
Getting the rowcount of a Sqlite resultset with SqlAlchemy
How database inconsistencies lead to a hack which makes my chest swell with pride.
I came accross an interesting problem this morning when I was trying to get the rowcount of a Sqlite resultset. When using the MySQL engine, you can use the rowcount property to count the number of rows returned from a query, but strangely the same operation gives -1 when using a sqlite database.
At first, I thought the problem was due to SqlAlchemy which I was using, and quickly made a script to isolate and compare:
Which results in:
Testing mysql & sqlite rowcount with SqlAlchemy v0.5.8 Creating schemas Inserting some data all_mysql_rows.rowcount: 1 len(all_mysql_rows.fetchall()): 1 all_sqlite_rows.rowcount: -1 len(all_sqlite_rows.fetchall()): 1
After some digging around, I discovered that the problem is infact due to pysqlite, and I quote:
Well this was a bummer, because I wanted to write code that I could switch between Sqlite and MySQL - using MySQL in production/development, and sqlite in-memory databases for unit-testing etc. Then it occurred to me: code for unit testing doesn't have to be extremely efficient, so perhaps I could monkey-patch SqlAlchemy's ResultProxy to make up for sqlite's deficiancy...
The hack
Posted on Fri, 12 Feb 2010 | permanent link to this entry | View / Add comments
Web hooks for ejabberd and making XMPP bots
Making XMPP chat-bots as web applications
Over summer last year I'd been mucking around a bit with XMPP, Erlang and ejabberd. I wanted to quickly make some XMPP chat-bots and, being a web-developer, it occurred to me it would be cool to have a http backend to a chat server, so that I could treat incoming messages as simple HTTP requests. The idea was to:
- have a standard XMPP server handle everything related
to federation,
S2S communication, encryption and authentication; - develop the bot as a traditional web app, that receives messages and spits out replies.
So I set about making an http-rpc module for ejabberd. Basically, the idea is that it would do the same as mod_rest but backwards: mod_rest allows you to POST stanzas to ejabberd via HTTP, but I wanted to post stanzas received by ejabberd to a restful webservice. I called this RPC gateway mod_motion (as in, the opposite of mod_rest :-).

So here's a little guide to how I made and how you can install mod_motion. First off, you will need to install the ejabberd server and some other packages:
I don't want to go too much into setting up an ejabberd XMPP server, but if you are running your chat server publicly, you will need to create some DNS SRV records to make it 'discoverable' by other chat servers, as defined in rfc3920. They should look something like this:
_xmpp-client._tcp SRV 5 0 5222 yourdomain.com. _xmpp-server._tcp SRV 5 0 5269 yourdomain.com.
Now to the module. If you'd like to read a great introduction to making modules for ejabberd, have a look at these articles by Anders Conberes:
You'll also have to get hold of a copy of the ejabberd source in order to compile the module:
You can download the module from my public SVN repository:
Mod_motion works by simply posting all messages, presence and iq stanzas to a web server defined as BASE_URL in the module.
So presence stanzas will be posted to:
http://127.0.0.1:8080/presence/user@host
Likewise, messages will be posted to:
http://127.0.0.1:8080/message/user@host
And IQs get forwarded via HTTP POST to:
http://127.0.0.1:8080/iq/user@host
In each case, user@host is the JID of the remote user.
To install it you will just need to modify the modules section in /etc/ejabberd/ejabberd.cfg thus:
Now to compile and install the module with the script install.sh.
Here's a simple echobot, implemented with web.py; see also the examples folder.
Here's the source of mod_motion:
Credit: echo_bot.erl by Anders Conbere.
Posted on Mon, 01 Feb 2010 | permanent link to this entry | View / Add comments
What Model-View-Controller really means
Model-View-Controller
is an architectural pattern commonly used in software applications,
which works like this:
- The model provides a domain-specific representation of the data used by the application.
- The view renders data to externally-usable formats, typically UI elements.
- The controller accepts and handles foreign inputs (for example, user input), performs relevant operations on models and initiates a response.
If a recruiter asks you to explain MVC, you can recite these three points and ace the interview. But I think it kind of misses the whole point of MVC.
According to Trygve Reenskaug, who first described MVC in 1979: "The essential purpose of MVC is to bridge the gap between the human user's mental model and the digital model that exists in the computer...MVC was conceived as a general solution to the problem of users controlling a large and complex data set."
So MVC is as much about usability as it is about system architecture. Unlike other design patterns (such as those described by the Gang of Four), MVC is an 'outward looking' pattern, that applies to an entire system. It's original purpose is to help users to understand the working of a system by providing a consistant mapping between the user's mental model, and the domain- or business-logic.
It's hardly surprising then, that MVC has become such a popular pattern; it helps developers understand systems as much as it helps users.
There are about 1 million frameworks available that implement MVC along with related gadgetry: UI templating systems, Object-relational mappers etc. Some frameworks provide strict enforcement of MVC's "rules", for example by prohibiting access to models from within the views. Whichever MVC-system we use, we shouldn't lose sight of the original aim — to bridge the gap between the human user's mental model and the digital model. Something which no framework can do for you.
Update: at the behest of inn0 I encourage you to have a look at Trygve Reenskaug's article about the origins of MVC - it's a great read.
Posted on Fri, 22 Jan 2010 | permanent link to this entry | View / Add comments
Data.gov.uk: Where's the Data?
The new data.gov.uk site
has been launched which open up government for reuse by
companies and individuals. Sounds great!
- Government publishes data using "open standards, open source and open data"
- Geeks from accross the whole country get to work analysing, cross-referencing and building cool applications
- Everyone wins
The problem is it's still quite difficult to find really usable data. For example, take the first data set: 2008 Injury Road Traffic Collisions in Northern Ireland. What we actually get is a link to a landing page on the Police Service of Norther Ireland website with links to statistics on everything from crime statistics to Workforce Composition Figures.
And it turns out they're all PDFs, a next-to-useless format for data processing.
Also, I have found many pages that are simply a placeholder for future data, such as the page on the Annual abstract of statistics which currently states "There is currently no text in this page."
Now I think it's a great start, and there are already some pretty cool apps available, but I think that data.gov.uk could do better job of distinguishing between usable datasets, and placeholders or pdf reports.
Posted on Thu, 21 Jan 2010 | permanent link to this entry | View / Add comments
Blogging with PyBlosxom
This blog started a few weeks ago as a standard
Wordpress blog, and I quickly discovered I wanted
something a bit more 'interesting' as platform.
Sure, Wordpress ticks all the boxes and is increadibly easy to use, but part of me wanted something a little more ... nerdy
One of the things that I dislike about many blogging platforms is the very fact that they are web-based. But it's not so much the web interface that annoys me so much as the workflow you are forced into. Rather than managing content through a WYSIWYG editor, I'd much rather like to edit my posts with my favourite text editor, on any computer, and remotely manage everything. I'd also like version control, and be able to see the history of all my posts and additions.
I first experimented with Blosxom which is based on the simple idea of dropping text files into a directory, and a single perl script does the rest. However, I wanted something I could hack around with and I knew that if I went the perl route my blog would soon be abandoned. Wait! I'm not a perl hater, I just wanted somethying a bit more ... pythonic :)
Enter PyBlosxom. Practically the same, but written in python, PyBlosxom doesn't seem have as many users or plugins, but it does a good job all the same. I quickly ported the Carrington theme from Wordpress set about configuring it.
I wrote a little script called genblog.sh that generates the whole site statically, and manages all the css and other static files. I can edit my blog posts wherever I am and simply commit them into svn. A simple "deployment" script is all it takes to update the site:
Additionally, PyBlosxom allows you to use different configurations by passing a command-line option to the pyblosxom.py script. This way I am able to generate a 'preview' of new posts before I publish them, for example:
You'll notice that the whole site is plain vanilla HTML. To be honest, disqus is much better that anything I could locally myself, so there was no need to run PyBlosxom as a CGI script, but that is also an option.
I think it would be nice to see more develpment of things like PyBlosxom. Some people use github as a blog for the same reasons - a weblog with a hacker-friendly workflow. What do you use?
Posted on Wed, 20 Jan 2010 | permanent link to this entry | View / Add comments
Web-scraping and Geo-locating Ticket Restaurants
For a number of years I have been a user of various restaurant ticket
schemes. These are 'cheques' given out by employers to allow people to
buy lunch in a participating restaurant. These scremes are quite popular
in Spain and there are three major systems in use (along with others):
Sodexo Pass,
Ticket Restaurant and
Cheque Gourmet.
One annoying problem is the difficulty of finding restaurants that participate in any of these schemes. It would be nice to have different restaurants marked on a map, but each service offeres a fairly low-quality restaurant finder. In this web-scraping example, our aim is to scrape as much restaurant information as we can from different sources, and compile it all into a MySQL database. Useful information might include the name of the restaurant or bar, it's address, phone number and geographical coordinates. Lets go!
NB: all the code for this exercise is available in my public
SVN repository:
http://svn.happy.cat/public/restaurants/trunk/
You will need to install some dependencies: see install-deps.sh
The basic strategy involves 3 steps:
- Download raw data from public website
- Extract meaningful information
- Merge and load into useable data store
Easy peasy. In the Mobiticket case (a wap service offered by Ticket Restaurant) we need two steps, first to download an index and then to download the individual restaurant pages one by one. To avoid clobbering these services, we put a delay between each HTTP request.
Now comes the tricky part. To extract meaningful data from the HTML or WML that we downloaded, we revert to using python and some simple regular expressions. We could have probably used BeautifulSoup or some other kind of XML processing along with XPATH. I throw my hands up - I have no defence. In fact, if you are really getting serious about scraping, have a look at Scrapy.
Anyway, the files *-extract-to-tab.py simply churn through the data and spit out tabular data. For example:
So we run our scripts like this:And we should have three tabular files which we can import into our database.
At this stage we will need to set up a database table to store the data and create a config file. See the example config and setup the table with the restaurants.sql script.
That should give us a nice big table of some 66,725 restaurants. Olé!
But wouldn't it be nice to be able to put these on a map? Fortunately, we've been able to pinpoint around 10,000 restaurants because mobiticket included a map in their application, so we could scrape the latitude and longitude. But this accounts for less than 18% of the restaurants we know about. However, Google provides a reverse geocoding service which we can use to extract the coordinates for a given street address.
To use this service you need an API key which you should put in your config.py. Then run:
and go and have a cup of tea. Infact you'll have to wait quite a while, because your database contains around 45,000 restaurants that need to be geo-located, and Google places a limit of 15,000 geocode requests in a 24 hour period from a single IP address. For this reason we have to put a 6 second delay between each API request.
Anyway, I hope you enjoyed this little detour into web scraping. As you can see, the basic pattern is:
- download data from various sources
- process and normalize
- collate into a central store
Like I mentioned, there are frameworks like Scrapy that make it easy; personally I like to split the work into various smaller steps, so that I can introduce 'save points'. In Scrapy remember to do this in the item pipeline. There's a great tutorial to get you going.
Posted on Tue, 12 Jan 2010 | permanent link to this entry | View / Add comments
What this blog is about
So after a point-by-point rebuttal of the article by the unfortunate Sr. Civantos, here’s a sneek preview of what we’re in for:
Chart created with Create A Graph.
Posted on Mon, 21 Dec 2009 | permanent link to this entry | View / Add comments
First post: Christmas, spam and blogs
With Christmas upon us, email circulars are reaching crescendo but one particular example of blog-spam caught me eye. It displays all the classic traits of low quality journalism:
- recycled ‘news’ from a foreign source;
- total lack of fact-checking, to the point of comedy;
- controversial or outageous storyline;
- seasonal or topical content;
- large chorus of ignorant adherents chanting in the comments.
The piece was written by a one Daniel Civantos, who treats with great delicacy and care the thorny subject of religious
freedom. Apparently an 8-year-old
boy
was expelled from school and sent to a psycologist for drawing Jesus on the cross. So let’s investigate Daniel’s journalistic
capabilities…
- Our investigative hero kicks off describing the outrage of “a father from Taunton (United Kingdom)“, and helpfully links to the original article in the Taunton Daily Gazette: a newspaper published in Taunton Massachusettes (USA). Whoops! Still, only 5122km off the mark.
- Said father’s 8-year old son was reportedly sent home from school.
Not true, as the story has since been debunked by boston.com. - The drawing in question supposedly depicted Jesus on the cross with “X”s for eyes. However, it has since been learned that the child claimed that the drawing represented himself.
The father himself sums up the situation:
“It hurts me that they did this to my kid,” Chester Johnson, the boy’s father, told the Globe. “They can’t mess with our religion; they owe us a small lump sum for this.”
(So this had nothing at all to do with money, then.)
I suppose that if I can do a better job of informing half the readership of our super-sleuth Daniel, I will be satisfied of having done a good job.
More to follow!
[1] http://www.geobytes.com/CityDistanceTool.htm
Posted on Mon, 21 Dec 2009 | permanent link to this entry | View / Add comments
