Skip to content

-->

Web-scraping and Geo-locating Ticket Restaurants

For a number of years I have been a user of various restaurant ticket schemes. These are 'cheques' given out by employers to allow people to buy lunch in a participating restaurant. These scremes are quite popular in Spain and there are three major systems in use (along with others): Sodexo Pass, Ticket Restaurant and Cheque Gourmet.

One annoying problem is the difficulty of finding restaurants that participate in any of these schemes. It would be nice to have different restaurants marked on a map, but each service offeres a fairly low-quality restaurant finder. In this web-scraping example, our aim is to scrape as much restaurant information as we can from different sources, and compile it all into a MySQL database. Useful information might include the name of the restaurant or bar, it's address, phone number and geographical coordinates. Lets go!

NB: all the code for this exercise is available in my public SVN repository:
http://svn.happy.cat/public/restaurants/trunk/
You will need to install some dependencies: see install-deps.sh

The basic strategy involves 3 steps:

  1. Download raw data from public website
  2. Extract meaningful information
  3. Merge and load into useable data store
To do the downloading we use simple shell scripts and wget; for example:

Easy peasy. In the Mobiticket case (a wap service offered by Ticket Restaurant) we need two steps, first to download an index and then to download the individual restaurant pages one by one. To avoid clobbering these services, we put a delay between each HTTP request.

Now comes the tricky part. To extract meaningful data from the HTML or WML that we downloaded, we revert to using python and some simple regular expressions. We could have probably used BeautifulSoup or some other kind of XML processing along with XPATH. I throw my hands up - I have no defence. In fact, if you are really getting serious about scraping, have a look at Scrapy.

Anyway, the files *-extract-to-tab.py simply churn through the data and spit out tabular data. For example:

So we run our scripts like this:

And we should have three tabular files which we can import into our database.

At this stage we will need to set up a database table to store the data and create a config file. See the example config and setup the table with the restaurants.sql script.

That should give us a nice big table of some 66,725 restaurants. Olé!

But wouldn't it be nice to be able to put these on a map? Fortunately, we've been able to pinpoint around 10,000 restaurants because mobiticket included a map in their application, so we could scrape the latitude and longitude. But this accounts for less than 18% of the restaurants we know about. However, Google provides a reverse geocoding service which we can use to extract the coordinates for a given street address.

To use this service you need an API key which you should put in your config.py. Then run:

and go and have a cup of tea. Infact you'll have to wait quite a while, because your database contains around 45,000 restaurants that need to be geo-located, and Google places a limit of 15,000 geocode requests in a 24 hour period from a single IP address. For this reason we have to put a 6 second delay between each API request.

Anyway, I hope you enjoyed this little detour into web scraping. As you can see, the basic pattern is:

  • download data from various sources
  • process and normalize
  • collate into a central store

Like I mentioned, there are frameworks like Scrapy that make it easy; personally I like to split the work into various smaller steps, so that I can introduce 'save points'. In Scrapy remember to do this in the item pipeline. There's a great tutorial to get you going.


blog comments powered by Disqus