So I was moaning about the Government and the release of lists of meetings with external organisations. Well, what about some action? I've written a scraper that aggregates all the existing data and sticks it in a sinister database. At the moment, the Cabinet Office, DEFRA, and the Scottish Office have coughed up the files and are all included. I'm going to add more departments as they become available. Scraperwiki seems to be a bit sporky this evening; the whole thing has run to completion, although for some reason you can't see all the data, and I've added the link to the UK Open Government Licence twice without it being saved.
A couple of technical points: to start with, I'd like to thank this guy who wrote an alternative to Python's csv module's wonderful DictReader class. DictReader is lovely because it lets you open a CSV (or indeed anything-separated value) file and keep the rows of data linked to their column headers as python dictionaries. Unfortunately, it won't handle Unicode or anything except UTF-8. Which is a problem if you're Chinese, or as it happens, if you want to read documents produced by Windows users, as they tend to use Really Strange characters for trivial things like apostrophes (\x92, can you believe it?). This, however, will process whatever encoding you give it and will still give you dictionaries. Thanks!
I also discovered something fun about ScraperWiki itself. It's surprisingly clever under the bonnet - I was aware of various smart things with User Mode Linux and heavy parallelisation going on, and I recall Julian Todd talking about his plans to design a new scaling architecture based on lots of SQLite databases in RAM as read-slaves. Anyway, I had kept some URIs in a list, which I was then planning to loop through, retrieving the data and processing it. One of the URIs, DEFRA's, ended like so: oct2010.csv.
Obviously, I liked the idea of generating the filename programmatically, in the expectation of future releases of data. For some reason, though, the parsing kept failing as soon as it got to the DEFRA page. Weirdly, what was happening was that the parser would run into a chunk of HTML and, obviously enough, choke. But there was no HTML. Bizarre. Eventually I thought to look in the Scraperwiki debugger's Sources tab. To my considerable surprise, all the URIs were being loaded at once, in parallel, before the processing of the first file began. This was entirely different from the flow of control in my program, and as a result, the filename was not generated before the HTTP request was issued. DEFRA was 404ing, and because the csv module takes a file object rather than a string, I was using urllib.urlretrieve() rather than urlopen() or scraperwiki.scrape(). Hence the HTML.
So, Scraperwiki does a silent optimisation and loads all your data sources in parallel on startup. Quite cool, but I have to say that some documentation of this feature might be nice, as multithreading is usually meant to be voluntary:-)
TODO, meanwhile: at the moment, all the organisations that take part in a given meeting are lumped together. I want to break them out, to facilitate counting the heaviest lobbyists and feeding visualisation tools. Also, I'd like to clean up the "Purpose of meeting" field so as to be able to do the same for subject matter.
Update: Slight return. Fixed the unique keying requirement by creating a unique meeting id.
Update Update: Would anyone prefer if the data output schema was link-oriented rather than event-oriented? At the moment it preserves the underlying structure of the data releases, which have one row for each meeting. It might be better, when I come to expand the Name of External Org field, to have a row per relationship, i.e. edge in the network. This would help a lot with visualisation. In that case, I'd create a non-unique meeting identifier to make it possible to recreate the meetings by grouping on that key, and instead have a unique constraint on an identifier for each link.
Update Update Update: So I made one.