Comments on The Yorkshire Ranter: lobby: update

The categorical nature (dividing lobbyists by indu...

2011-04-20T12:58:47.959+01:00

The categorical nature (dividing lobbyists by industrial sector) of the data makes Correspondence Analysis particularly applicable. As well there are a bunch of machine learning methods that would work nicely. There's a standard set of bioinformatics data exploration tools I'd like to try throwing the data at to see what happens.

As for visualization, there are some pretty pictures that fall out of PCA and correspondence analysis.

My problem with large network graphs is they rapidly turn into "ridiculograms", uninformatively dense meshes. This is common with protein-protein interaction graphs where there is a lot of noise in the data.

Will download the data, please post when it's tidied; long weekend coming up to play with it!

Gawp: PCA is a cool approach. NetworkX's clust...

2011-04-19T22:39:15.187+01:00

Gawp: PCA is a cool approach. NetworkX's clustering algorithms might provide something similar (list). As far as visualisation goes, I've been looking at strategies - I like the idea of a radial-graph look, and the CAIDA Skitter graphs are an inspiration but that does mean reimplementing it for an undirected graph.

You can get the data out of ScraperWiki, but I need to backport some data cleaning from the project into the scraper.

Anon: AppEngine doesn't have this lib: Network...

2011-04-19T22:24:30.819+01:00

Anon: AppEngine doesn't have this lib: NetworkX.

You can generate a set of weighted values by lobby...

2011-04-18T14:18:10.876+01:00

You can generate a set of weighted values by lobbyist for ministries they lobby. This would allow clustering of lobbyists by activity; banks should cluster together, for example.

Principal Component Analysis on this would probably be best for this, as you would be able to cluster by direction in lobbyspace verses position, this normalizes for intensity of lobbying. This will show what percentage of lobbying effort is explained by the first 2 PCA vectors; this might lead to some nice clustering right there. If, say, 50% of lobbying is on the first vector investigation of that vector will tell you a lot about standard distribution of lobbying efforts. Analysis by sector will tell you something too.

And similar analysis on the ministries might work too; a weighted vector for each ministry of who is lobbying them. What ministries are most similarly lobbied? PCA of this would show what spectrum of ministries is lobbied and by what proportion.

Much of this is probably predictable, but it will give a nice visualization and anomalies are often informative.

Do you have a data set handy? I'd love to have a go at it...

Rather than using on off line cron job and uploadi...

2011-04-18T10:58:57.090+01:00

Rather than using on off line cron job and uploading, why not use the new pipeline library to massage all the data on appengine:

http://goo.gl/MDrE5

Have you published the source for this somewhere?

Because Scraperwiki is for scraping.

2011-04-17T18:38:43.571+01:00

Because Scraperwiki is for scraping.

What's the thinking behind taking it off scrap...

2011-04-17T18:35:58.948+01:00

What's the thinking behind taking it off scraperwiki?