Dubai Airport has done something awful to their Web site; where once flights were organised in table rows with class names like "data-row2", now, exactly half the flights are like that, they've been split between separate arrival, departure, and cargo-only pages, they only show the latest dozen or so movements each, and the rows that aren't "data-row2" don't have any class attributes but random HTML colours.
And the airline names have disappeared, replaced by their logos as GIFs. Unhelpful, but then, why should they want to help me?
Anyway, I've solved the parsing issue with following horrible hack.
output = [[td.string or td.img["src"] for td in tr.findAll(True) if td.string or td.img] for tr in soup.findAll('tr', bgcolor=lambda(value): value == 'White' or value == '#F7F7DE')]
As it happened, I later realised I didn't need to bother grabbing the logo filenames in order to extract airline identifiers from them, so the td.img["src"] bit can be dropped.
But it looks like I'm going to need to do the lookup from ICAO or IATA identifiers to airline names, which is necessary to avoid having to remake the whitelist and the database and the stats script, myself. Fortunately, there's a list on wikipedia. The good news is that I've come up with a way of differentiating the ICAO and the IATA names in the flight numbers. ICAOs are always three alphabetical characters; IATAs are two alphanumeric characters, which aren't necessarily globally unique. In a flight number, they can be followed by a number of variable length.
But if the third character in the flight number is a digit, the first two must be an IATA identifier; if a string, it must be an ICAO identifier.