The information is in an HTML table, enclosed by td tags nested in tr tags, and governed by three CSS classes, "flight-data", "data-head" and "data-row2". The latter pair are used only within the first. So you would think something like this would work:
for item in soup.findAll('td', {'class': 'flight-data'}):
...output.append(item)
The ellipsis is there to make the indentation obvious in this post. Where soup is naturally an instance of BeautifulSoup that's been fed the webpage as a file-like object. But it doesn't; it does grab some of the data, but it also grabs much of the webpage as raw html, including the header and a gaggle of javascript. And it's slow, dammit. I can't be too far off beam, because I'm successfully parsing another very similar website using a near-identical parse command.I've tried various interlocking restrictions, and searching for both data-head and data-row2, but these usually find nothing.
4 comments:
[[td.string for td in tr.findAll('td') if td.string] for tr in soup.findAll('tr', {'class': 'data-row2'})]
oh yes, as of speed, first:
soup = soup.find('table', {'id': 'dgArrivals', 'class': 'flight-data'})
(only 'id' is enough, though)
if you need more speed, you'd want to use lxml.
Hey, I tried that; it eventually sporked the python interpreter, not before producing reams of unparsed html.
OK; slight change; try again - works!!
Post a Comment