Skip to content

-->

Web hooks for XMPP / ejabberd

Making XMPP chat-bots as web applications

Over summer last year I'd been mucking around a bit with XMPP, Erlang and ejabberd. I wanted to quickly make some XMPP chat-bots and, being a web-developer, it occurred to me it would be cool to have a http backend to a chat server, so that I could treat incoming messages as simple HTTP requests. The idea was to:

  • have a standard XMPP server handle everything related to federation, S2S communication, encryption and authentication;
  • develop the bot as a traditional web app, that receives messages and spits out replies.

So I set about making an http-rpc module for ejabberd. Basically, the idea is that it would do the same as mod_rest but backwards: mod_rest allows you to POST stanzas to ejabberd via HTTP, but I wanted to post stanzas received by ejabberd to a restful webservice. I called this RPC gateway mod_motion (as in, the opposite of mod_rest :-).

Diagram of mod_motion

So here's a little guide to how I made and how you can install mod_motion. First off, you will need to install the ejabberd server and some other packages:

I don't want to go too much into setting up an ejabberd XMPP server, but if you are running your chat server publicly, you will need to create some DNS SRV records to make it 'discoverable' by other chat servers, as defined in rfc3920. They should look something like this:

_xmpp-client._tcp  	SRV  	5 0 5222 yourdomain.com.
_xmpp-server._tcp 	SRV 	5 0 5269 yourdomain.com.

Now to the module. If you'd like to read a great introduction to making modules for ejabberd, have a look at these articles by Anders Conberes:

  1. Compiling Erlang
  2. Generic Modules
  3. HTTP Modules
  4. XMPP Bots

You'll also have to get hold of a copy of the ejabberd source in order to compile the module:

You can download the module from my public SVN repository:

Mod_motion works by simply posting all messages, presence and iq stanzas to a web server defined as BASE_URL in the module.

So presence stanzas will be posted to:

http://127.0.0.1:8080/presence/user@host

Likewise, messages will be posted to:

http://127.0.0.1:8080/message/user@host

And IQs get forwarded via HTTP POST to:

http://127.0.0.1:8080/iq/user@host

In each case, user@host is the JID of the remote user.

To install it you will just need to modify the modules section in /etc/ejabberd/ejabberd.cfg thus:

Now to compile and install the module with the script install.sh.

Here's a simple echobot, implemented with web.py; see also the examples folder.

Here's the source of mod_motion:

Credit: echo_bot.erl by Anders Conbere.


What Model-View-Controller really means

Model-View-Controller mental model Model-View-Controller is an architectural pattern commonly used in software applications, which works like this:

  • The model provides a domain-specific representation of the data used by the application.
  • The view renders data to externally-usable formats, typically UI elements.
  • The controller accepts and handles foreign inputs (for example, user input), performs relevant operations on models and initiates a response.

If a recruiter asks you to explain MVC, you can recite these three points and ace the interview. But I think it kind of misses the whole point of MVC.

According to Trygve Reenskaug, who first described MVC in 1979: "The essential purpose of MVC is to bridge the gap between the human user's mental model and the digital model that exists in the computer...MVC was conceived as a general solution to the problem of users controlling a large and complex data set."

So MVC is as much about usability as it is about system architecture. Unlike other design patterns (such as those described by the Gang of Four), MVC is an 'outward looking' pattern, that applies to an entire system. It's original purpose is to help users to understand the working of a system by providing a consistant mapping between the user's mental model, and the domain- or business-logic.

It's hardly surprising then, that MVC has become such a popular pattern; it helps developers understand systems as much as it helps users.

There are about 1 million frameworks available that implement MVC along with related gadgetry: UI templating systems, Object-relational mappers etc. Some frameworks provide strict enforcement of MVC's "rules", for example by prohibiting access to models from within the views. Whichever MVC-system we use, we shouldn't lose sight of the original aim — to bridge the gap between the human user's mental model and the digital model. Something which no framework can do for you.

Update: at the behest of inn0 I encourage you to have a look at Trygve Reenskaug's article about the origins of MVC - it's a great read.


Data.gov.uk: Where's the Data?

The new data.gov.uk site has been launched which open up government for reuse by companies and individuals. Sounds great!

  1. Government publishes data using "open standards, open source and open data"
  2. Geeks from accross the whole country get to work analysing, cross-referencing and building cool applications
  3. Everyone wins

The problem is it's still quite difficult to find really usable data. For example, take the first data set: 2008 Injury Road Traffic Collisions in Northern Ireland. What we actually get is a link to a landing page on the Police Service of Norther Ireland website with links to statistics on everything from crime statistics to Workforce Composition Figures.

And it turns out they're all PDFs, a next-to-useless format for data processing.

Also, I have found many pages that are simply a placeholder for future data, such as the page on the Annual abstract of statistics which currently states "There is currently no text in this page."

Now I think it's a great start, and there are already some pretty cool apps available, but I think that data.gov.uk could do better job of distinguishing between usable datasets, and placeholders or pdf reports.


Blogging with PyBlosxom

pyBlosxom logo This blog started a few weeks ago as a standard Wordpress blog, and I quickly discovered I wanted something a bit more 'interesting' as platform.

Sure, Wordpress ticks all the boxes and is increadibly easy to use, but part of me wanted something a little more ... nerdy

One of the things that I dislike about many blogging platforms is the very fact that they are web-based. But it's not so much the web interface that annoys me so much as the workflow you are forced into. Rather than managing content through a WYSIWYG editor, I'd much rather like to edit my posts with my favourite text editor, on any computer, and remotely manage everything. I'd also like version control, and be able to see the history of all my posts and additions.

I first experimented with Blosxom which is based on the simple idea of dropping text files into a directory, and a single perl script does the rest. However, I wanted something I could hack around with and I knew that if I went the perl route my blog would soon be abandoned. Wait! I'm not a perl hater, I just wanted somethying a bit more ... pythonic :)

Enter PyBlosxom. Practically the same, but written in python, PyBlosxom doesn't seem have as many users or plugins, but it does a good job all the same. I quickly ported the Carrington theme from Wordpress set about configuring it.

I wrote a little script called genblog.sh that generates the whole site statically, and manages all the css and other static files. I can edit my blog posts wherever I am and simply commit them into svn. A simple "deployment" script is all it takes to update the site:

Additionally, PyBlosxom allows you to use different configurations by passing a command-line option to the pyblosxom.py script. This way I am able to generate a 'preview' of new posts before I publish them, for example:

You'll notice that the whole site is plain vanilla HTML. To be honest, disqus is much better that anything I could locally myself, so there was no need to run PyBlosxom as a CGI script, but that is also an option.

I think it would be nice to see more develpment of things like PyBlosxom. Some people use github as a blog for the same reasons - a weblog with a hacker-friendly workflow. What do you use?


Web-scraping and Geo-locating Ticket Restaurants

For a number of years I have been a user of various restaurant ticket schemes. These are 'cheques' given out by employers to allow people to buy lunch in a participating restaurant. These scremes are quite popular in Spain and there are three major systems in use (along with others): Sodexo Pass, Ticket Restaurant and Cheque Gourmet.

One annoying problem is the difficulty of finding restaurants that participate in any of these schemes. It would be nice to have different restaurants marked on a map, but each service offeres a fairly low-quality restaurant finder. In this web-scraping example, our aim is to scrape as much restaurant information as we can from different sources, and compile it all into a MySQL database. Useful information might include the name of the restaurant or bar, it's address, phone number and geographical coordinates. Lets go!

NB: all the code for this exercise is available in my public SVN repository:
http://svn.happy.cat/public/restaurants/trunk/
You will need to install some dependencies: see install-deps.sh

The basic strategy involves 3 steps:

  1. Download raw data from public website
  2. Extract meaningful information
  3. Merge and load into useable data store
To do the downloading we use simple shell scripts and wget; for example:

Easy peasy. In the Mobiticket case (a wap service offered by Ticket Restaurant) we need two steps, first to download an index and then to download the individual restaurant pages one by one. To avoid clobbering these services, we put a delay between each HTTP request.

Now comes the tricky part. To extract meaningful data from the HTML or WML that we downloaded, we revert to using python and some simple regular expressions. We could have probably used BeautifulSoup or some other kind of XML processing along with XPATH. I throw my hands up - I have no defence. In fact, if you are really getting serious about scraping, have a look at Scrapy.

Anyway, the files *-extract-to-tab.py simply churn through the data and spit out tabular data. For example:

So we run our scripts like this:

And we should have three tabular files which we can import into our database.

At this stage we will need to set up a database table to store the data and create a config file. See the example config and setup the table with the restaurants.sql script.

That should give us a nice big table of some 66,725 restaurants. Olé!

But wouldn't it be nice to be able to put these on a map? Fortunately, we've been able to pinpoint around 10,000 restaurants because mobiticket included a map in their application, so we could scrape the latitude and longitude. But this accounts for less than 18% of the restaurants we know about. However, Google provides a reverse geocoding service which we can use to extract the coordinates for a given street address.

To use this service you need an API key which you should put in your config.py. Then run:

and go and have a cup of tea. Infact you'll have to wait quite a while, because your database contains around 45,000 restaurants that need to be geo-located, and Google places a limit of 15,000 geocode requests in a 24 hour period from a single IP address. For this reason we have to put a 6 second delay between each API request.

Anyway, I hope you enjoyed this little detour into web scraping. As you can see, the basic pattern is:

  • download data from various sources
  • process and normalize
  • collate into a central store

Like I mentioned, there are frameworks like Scrapy that make it easy; personally I like to split the work into various smaller steps, so that I can introduce 'save points'. In Scrapy remember to do this in the item pipeline. There's a great tutorial to get you going.


What this blog is about

So after a point-by-point rebuttal of the article by the unfortunate Sr. Civantos, here’s a sneek preview of what we’re in for:

Chart created with Create A Graph.


First post: Christmas, spam and blogs

With Christmas upon us, email circulars are reaching crescendo but one particular example of blog-spam caught me eye. It displays all the classic traits of low quality journalism:

  • recycled ‘news’ from a foreign source;
  • total lack of fact-checking, to the point of comedy;
  • controversial or outageous storyline;
  • seasonal or topical content;
  • large chorus of ignorant adherents chanting in the comments.

The piece was written by a one Daniel Civantos, who treats with great delicacy and care the thorny subject of religious freedom. Apparently an 8-year-old boy was expelled from school and sent to a psycologist for drawing Jesus on the cross. So let’s investigate Daniel’s journalistic capabilities…

  1. Our investigative hero kicks off describing the outrage of “a father from Taunton (United Kingdom)“, and helpfully links to the original article in the Taunton Daily Gazette: a newspaper published in Taunton Massachusettes (USA). Whoops! Still, only 5122km off the mark.
  2. Said father’s 8-year old son was reportedly sent home from school.
    Not true, as the story has since been debunked by boston.com.
  3. The drawing in question supposedly depicted Jesus on the cross with “X”s for eyes. However, it has since been learned that the child claimed that the drawing represented himself.

The father himself sums up the situation:

“It hurts me that they did this to my kid,” Chester Johnson, the boy’s father, told the Globe. “They can’t mess with our religion; they owe us a small lump sum for this.”

(So this had nothing at all to do with money, then.)

I suppose that if I can do a better job of informing half the readership of our super-sleuth Daniel, I will be satisfied of having done a good job.

More to follow!

[1] http://www.geobytes.com/CityDistanceTool.htm


'Chocolate' VIM colorscheme

Here's the VIM colorscheme I'm using:

It's called chocolate because it's based on the W3C core style with the same name. Download chocolate.vim here. If you like chocolate you may also like vividchalk.