Sunday, September 23, 2007

Pathetic Python Blogging

Dear Lazyweb - can anyone work out why I can't get useful data out of this page with BeautifulSoup and Python 2.5?

The information is in an HTML table, enclosed by td tags nested in tr tags, and governed by three CSS classes, "flight-data", "data-head" and "data-row2". The latter pair are used only within the first. So you would think something like this would work:
for item in soup.findAll('td', {'class': 'flight-data'}):
The ellipsis is there to make the indentation obvious in this post. Where soup is naturally an instance of BeautifulSoup that's been fed the webpage as a file-like object. But it doesn't; it does grab some of the data, but it also grabs much of the webpage as raw html, including the header and a gaggle of javascript. And it's slow, dammit. I can't be too far off beam, because I'm successfully parsing another very similar website using a near-identical parse command.

I've tried various interlocking restrictions, and searching for both data-head and data-row2, but these usually find nothing.


Unknown said...

[[td.string for td in tr.findAll('td') if td.string] for tr in soup.findAll('tr', {'class': 'data-row2'})]

Unknown said...

oh yes, as of speed, first:

soup = soup.find('table', {'id': 'dgArrivals', 'class': 'flight-data'})

(only 'id' is enough, though)

if you need more speed, you'd want to use lxml.

Alex said...

Hey, I tried that; it eventually sporked the python interpreter, not before producing reams of unparsed html.

Alex said...

OK; slight change; try again - works!!

