So, there's this rumour-surrounded gadget that GIYUS wants people to install on their computers as part of the War on Terror. Obviously, I wondered exactly how it worked; did it analyse the Web sites you visit semantically, so as to target its talking points precisely? Did it use some sort of social recommendation mechanism? Also, I was wondering if there was any way of characterising the network traffic it generated and estimating how many people are using it.
And actually, it's kind of disappointing; no folksonomy, no textual analysis, not even crude keyword matching. It just grabs an RSS feed from ws.collactive.com, passing in the string "GIYUS", presumably to ensure it gets the right one, checks if any items in it aren't already cached, and if so, fires a graphical alert containing the message. It's basically a e-mail list gussied up in Web2.0 finery, with the feature that it's marginally less trivial to forward the content to nonsubscribers. It doesn't even appear to spy on your browsing history.
Of course, there could be some server-side magic involved. You can usually get a rough idea of location from an IP address, and a rough idea is probably best in terms of hit-rate (you've a much better chance of getting your geotargeting right for "North London" than "Archway"). And you can draw some conclusions from browser credentials - OS, screen, browser type and version etc. For example, perhaps you'd want to serve the red meat civilian deaths are all a fake stuff to MSIE5/6 users in teh US heartland and the Decent Left stuff to Mac users in North London. So I considered actually installing the extension; but then I realised I didn't actually want a simulated Melanie Phillips on my sofa any more than I wanted the real thing. However, it's possible to view the feed on the Web anyway, so I checked.
But they may not even be doing that; I'm on a weird niche ISP, with a linux machine, in North London, and the feed I see at http://ws.giyus.org/points/list is deeply generic.
Surely, though, it's possible to do better than this? I envisage a sort of Web force multiplier, that would analyse the texts you read as you browse and compute some kind of digest hash, and do the same for every link you send anyone else, stashing the hash of each link in a remote server. As you browse, it compares the hash of the current page with the ones in the DB, and returns a list of possibly appropriate arguments - the strength of this being that they could be data, poetry, code, pictures, video, or indeed anything. We could incorporate some sort of social element, too, to keep a check on quality.
Who here knows about corpus analysis? Most of the academic papers my casual search found gave me that "dog listening to music" feeling. What I need is something like a rather bad crypto hash function - one where two texts with different content would produce non-randomly different hashes. Obviously we'd filter the text with a list of stop words like search engines do, so as to strip out the tehs and ands. We could, for example, use (say) the distribution of words in Wikipedia as a common baseline, and measure how the distribution of significant words in the target texts differs from it.