Sunday, June 12, 2011

meet Project Lobster

My lobbying project has been entered in the Open Data Challenge! Someone posted this to the MySociety list, with rather fewer than the advertised 36 hours left. I was at a wedding and didn't read it at the time. After my partner and I had tried to invent a tap routine to the back end of Prince's "Alphabet Street" and had got up at 8am to make it for the sadistic bed & breakfast breakfast and gone back to help clean up and drink any unaccountably unconsumed champagne, and the only thing left to look forward to was the end of the day, I remembered the message and noted that I had to get it filed before midnight.

So it was filed in the Apps category - there's an Ideas category but that struck me as pathetic, and after all there is some running code. I pushed on to try and get something out under the Visualisation category but ManyEyes was a bit broken that evening and anyway its network diagram view starts to suck after a thousand or so vertices.

As a result, the project now has a name and I have some thin chance of snagging an actual Big Society cheque for a few thousand euros and a trip to Brussels. (You've got to take the rough with the smooth.)

The most recent experiment with the Lobster Project - see, it's got a name! It's got you its grips before you're lets you think you're king when you're really a prawn...whoops, wrong shellfish - was to try out a new centrality metric, networkx.algorithms.centrality.betweenness_centrality. This is defined as the fraction of the shortest paths between all the pairs of nodes in the network that pass through a given node. As you have probably guessed, this is quite an inefficient metric to compute and the T1700 lappy took over a minute to crunch it compared to 7 seconds to complete the processing script without it. Perhaps the new KillPad would do better but the difference is big enough that it's obviously my fault.

Worth bothering with?

As far as I can see, though, it's also not very useful. The results are correlated (R^2 = 0.64) with the infinitely faster weighted graph degree. (It also confirms that Francis Maude is the secret ruler of the world, though.)

The NX functions I'm really interested in, though, are the ones for clique discovery and blockmodelling. It's obvious that with getting on for 3,000 links and more to come, any visualisation is going to need a lot of reduction. Blockmodelling basically chops your network into groups of nodes you provide and aggregates the links between those groups - it's one way, for example, to get department level results.

But I'd be really interested to use empirical clique discovery to feed into blockmodelling - the API for the one generates a python list of cliques, which are themselves lists of nodes, and the other accepts a list of nodes or a list of lists (of nodes). Another interesting option might be to blockmodel by edge attribute, which would be a way of deriving results for the content of meetings via the "Purpose of meeting" field. However, that would require creating a list of unique meeting subjects and then iterating over it creating lists of nodes with at least one edge having that subject, and then shoving the resulting list-of-lists into the blockmodeller.

That's a lorra lorra iteratin' by anybody's standards, even if, this being Python, most of it will end up being rolled up in a couple of seriously convoluted list comps. Oddly enough, it would be far easier in a query language or an ORM, but I've not heard of anything that lets you do SQL queries against a NX graph.

Having got this far, I notice that I've managed to blog my enthusiasm back up.

Anyway, I think it's perhaps time for a meetup on this next week with Who's Rob-bying.

No comments:

kostenloser Counter