It was WhoseKidAreYou taking up the time; I'm learning steadily about SPARQL and various Javascript things, notably XPath, which I have to agree is a pretty cool way of dismantling, remantling, and generally fiddling with HTML/XML documents, even compared to BeautifulSoup. (You can search for a pattern - for example, anything with the class attribute "byline" - and then index into the results by a filesystem-like / notation, which is handy when the material you need is inside a sensibly named entity but wrapped in random tags, a surprisingly common antipattern.) I've also identified 11 newspapers' patterns for bylines, and in the cases where the metadata is in the meta tags, like it should be, I've also identified where the byline block appears in the text.
For example - the Torygraph puts the name of the author in a
meta name="author" content="A.N. Other"
, and they then have a div class="byline"
, but the byline div also includes the timestamp, so we're getting the byline from the meta tags to save post-processing it and then identifying the div for later.Fair enough; the next problem is the SPARQL query, which seems to be remarkably tickly and easy to break. The problem with this semantic web stuff is that it's so damn semantic; everything wants very closely specifying. In theory, it should be possible to grab a whole variety of data on the overentitled brat in question - employment, publications, criminal record, however. Which is nice.
The downside is, though, that DBpedia is dependent on decent infoboxes in Wikipedia articles to work. So if you want to help with WKAY (I like the acronym - sounds like a Mexican radio station in a Jack Kerouac novel) and you aren't coding, why not go and contribute relatives to Wikipedia?
Actually, I don't think the Wikimedia Foundation will let you do that, even if The Register likes to call them a cult. I mean, contribute other people's relatives. No. No. No slavery or grave-robbing, please. I mean, go and edit prominent idiots' Wikipedia entries and record whose kids they are, and pretty up the info boxes.
Speaking of info boxes, at the moment they are the best paradigm I can think of for displaying the data when we get it. It's pretty trivial to template HTML in Greasemonkey and to replace elements on the page with it, but I want it to look good. If Dan Lockton wants to join the Ggroup, that would be very helpful.
No comments:
Post a Comment