Sunday, August 08, 2010

scraping the barrel

I've finally got around to answering my own question here. The scraper is work in progress at the moment; the original pdf is rendered by pdftohtml into a tiresomely semi-structured (i.e. worse than no structure) tagpile. I was trying to tackle this through recursion, but I might either try using Python's continue keyword or perhaps trying to pre-tokenise the document based on the number of blank lines between blocks, and then deal with the blocks.

This all depends on the thing actually having any underlying structure, of course - it may be assembled by copy-and-paste, so anything I do will blow up every month. The things I do for England...

