Sunday, March 28, 2010

oiling the steel to sharpen the blade to shave the yak

Progress update on fixing the Vfeed.

Dubai Airport has done something awful to their Web site; where once flights were organised in table rows with class names like "data-row2", now, exactly half the flights are like that, they've been split between separate arrival, departure, and cargo-only pages, they only show the latest dozen or so movements each, and the rows that aren't "data-row2" don't have any class attributes but random HTML colours.

And the airline names have disappeared, replaced by their logos as GIFs. Unhelpful, but then, why should they want to help me?

Anyway, I've solved the parsing issue with following horrible hack.
output = [[td.string or td.img["src"] for td in tr.findAll(True) if td.string or td.img] for tr in soup.findAll('tr', bgcolor=lambda(value): value == 'White' or value == '#F7F7DE')]

As it happened, I later realised I didn't need to bother grabbing the logo filenames in order to extract airline identifiers from them, so the td.img["src"] bit can be dropped.

But it looks like I'm going to need to do the lookup from ICAO or IATA identifiers to airline names, which is necessary to avoid having to remake the whitelist and the database and the stats script, myself. Fortunately, there's a list on wikipedia. The good news is that I've come up with a way of differentiating the ICAO and the IATA names in the flight numbers. ICAOs are always three alphabetical characters; IATAs are two alphanumeric characters, which aren't necessarily globally unique. In a flight number, they can be followed by a number of variable length.

But if the third character in the flight number is a digit, the first two must be an IATA identifier; if a string, it must be an ICAO identifier.


Laban said...

What's the language ? I have an idea for a web-scraping application (basically a book search tool that checks Amazon/ebay/Abebooks etc), but no clue as to what to use.

The last time I did such a thing was 10 years back, only wanted the data from one site, downloaded the entire site, then merged the html into one file (with a dos command I think) stripped the tags and parsed the data with VB to produce csv output. Worked, but clunky.

Can't do that for multiple sites - I need to scrape the screens. How ? Last time I asked you about the airport data you said the code was 'of my own devisin' !

Laban said...

Do you think that web change has anything to do with your research, btw ?

yorksranter said...

The Vfeed is implemented in Python (like the code snippet above). Python has a fantastic third-party library for parsing HTML and XML documents called Beautiful Soup, which will screenscrape pretty much anything into useful data structures.

For example, that snippet finds all tr tags that have the attribute bgcolor with either the values "White" or "#7D7D7E" and then finds all td tags within each tr that have either a single string as their content or an img tag and returns the content or the image filename as a list of python list objects. That you can do this in a oneliner, admittedly a tortuous one, is one of the reasons to use Python - that's a nested list comprehension with a lambda function passed as a keyword argument.

If you need a clientside solution, you'd be looking at JavaScript and XPath.

Depending on how complex the job is, you might be able to get away with a Yahoo Pipe.

As far as DXB's motives go, I don't think so - it looks like the Pointy-Headed Boss wanted the colours changing one morning and they pushed out a really hacky fix.

Gridlock said...

Maybe these guys could give you a hand, or become a customer...

marry said...

Blogs are so informative where we get lots of information on any topic. Nice job keep it up!!

Photography Dissertation

kostenloser Counter