Sunday, November 14, 2010

killing data.gov.uk, and thinking aloud about mapping the lobbysphere

So the government thinks this is clever. They also think it constitutes a "searchable online database". It is not searchable, nor is it a database. It is a collection of links to department web sites, some of which actually lead to useful documents, some of which lead to utterly pointless intermediary pages, some of which lead to documents in a sensible format, some of which lead to documents in pointlessly wrong formats, and some of which lead to PDF files. It provides no clue how often this data will be released or when or where. The URIs sometimes suggest that they might be predictable, sometimes they are just random alphanumeric sequences. Basically, what he said.

Meanwhile, very few of these documents have made it onto data.gov.uk, the government's data web site (pro-tip: the hint is in the name) which provides all that stuff out of the box. This is not just disappointing - this is actively regressive. Is it official policy to break data.gov.uk?

Anyway, I've been fiddling with NetworkX, the network-graph library for Python from Los Alamos National Laboratory. Sadly it doesn't have a method networkx.earth_shattering_kaboom(). I've eventually decided that the visualisation paradigm I wanted was looking me in the eye all along - kc claffy's Skitter graph, used by CAIDA to map the Internet's peering architecture.

The algorithm is fairly simple - nodes are located in terms of polar coordinates, on a circular chart. In the original, the concept is that you are observing from directly above the north or south pole. This gives you two dimensions - angle, or in other words, how far around the circle you are, and radius, your location on the line from the centre to the edge. claffy et al used the longitude of each Autonomous System's WHOIS technical contact address for their angles, and the inverse of each node's linkdegree for the radius. Linkdegree is a metric of how deeply connected any given object in the network is; taking the inverse (i.e 1/linkdegree) meant that the more of it you have, the more central you are.

My plan is to define the centre as the prime minister, and to plot the ministries at the distance from him given by the weighting I'd already given them - basically, the prime minister is 1 and the rest are progressively less starting with Treasury and working down - and an arbitrary angle. I'm going to sort them by weight, so that importance falls in a clockwise direction, for purely aesthetic reasons. Then, I'll plot the lobbies. As they are the unknown factors, they all start with the same, small node weighting. Then add the edges - the links - which will have weights given by the weight of the ministry involved divided by the number of outside participants at that meeting, so a one-on-one is the ideal case.

When we come to draw the graph, the lobbies will be plotted with the mean angle of the ministries they have meetings with, and the inverse of their linkdegree, with the node size scaled by its traffic. Traffic in this case basically means how many meetings it had. Therefore, it should be possible to see both how effective the lobbying was, from the node's position, and how much effort was expended, from its size. The edges will be coloured by date, so as to make change over time visible. If it works, I'll also provide some time series things - unfortunately, if the release frequency is quarterly, as it may be, this won't be very useful.

Anyway, as always, to-do no.1 is to finish the web scraping - the Internet's dishes. And think of a snappy name.

1 comment:

Anonymous said...

(capcha = minsilit, the ministry of sycophancy, innumeracy, lunacy , irrationality and toadying)

Clearly people who are going to look through the whole of this data and come to quantitative fact-based solutions are not the target audience. What is there should be more than adequate for knee-jerks and hysteria.

kostenloser Counter