Fastest, easiest, and best way to parse an HTML table?

╄→гoц情女王★ 提交于 2021-02-04 14:53:09

问题


I'm trying to get this table http://www.datamystic.com/timezone/time_zones.html into array format so I can do whatever I want with it. Preferably in PHP, python or JavaScript.

This is the kind of problem that comes up a lot, so rather than looking for help with this specific problem, I'm looking for ideas on how to solve all similar problems.

BeautifulSoup is the first thing that comes to mind. Another possibility is copying/pasting it in TextMate and then running regular expressions.

What do you suggest?

This is the script that I ended up writing, but as I said, I'm looking for a more general solution.

from BeautifulSoup import BeautifulSoup
import urllib2


url = 'http://www.datamystic.com/timezone/time_zones.html';
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
tables = soup.findAll("table")
table = tables[1]
rows = table.findAll("tr")
for row in rows:
    tds = row.findAll('td')
    if(len(tds)==4):
        countrycode = tds[1].string
        timezone = tds[2].string
        if(type(countrycode) is not type(None) and type(timezone) is not type(None)):
            print "\'%s\' => \'%s\'," % (countrycode.strip(), timezone.strip())

Comments and suggestions for improvement to my python code welcome, too ;)


回答1:


For your general problem: try lxml.html from the lxml package (think of it as the stdlibs xml.etree on steroids: the same xml api, but with html support, xpath, xslt etc...)

A quick example for your specific case:

from lxml import html

tree = html.parse('http://www.datamystic.com/timezone/time_zones.html')
table = tree.findall('//table')[1]
data = [
           [td.text_content().strip() for td in row.findall('td')] 
           for row in table.findall('tr')
       ]

This will give you a nested list: each sub-list corresponds to a row in the table and contains the data from the cells. The sneakily inserted advertisement rows are not filtered out yet, but it should get you on your way. (and by the way: lxml is fast!)

BUT: More specifically for your particular use case: there are better way to get at timezone database information than scraping that particular webpage (aside: note that the web page actually mentions that you are not allowed to copy its contents). There are even existing libraries that already use this information, see for example python-dateutil.




回答2:


Avoid regular expressions for parsing HTML, they're simply not appropriate for it, you want a DOM parser like BeautifulSoup for sure...

A few other alternatives

  • SimpleHTMLDom PHP
  • Hpricot & Nokogiri Ruby
  • Web::Scraper Perl/CPAN

All of these are reasonably tolerant of poorly formed HTML.




回答3:


I suggest loading the document with an XML parser like DOMDocument::loadHTMLFile that is bundled with PHP and then use XPath to grep the data you need.

This is not the fastest way, but the most readable (in my opinion) in the end. You can use Regex, which will probably be a little faster, but would be bad style (hard to debug, hard to read).

EDIT: Actually this is hard because the page you mentioned is not valid HTML (see validator.w3.org). Especially tags with no opening/closing tag are heavily in the way.

It looks though like xmlstarlet ( http://xmlstar.sourceforge.net/ (great tool)) is able to repair the problem (run xmlstarlet fo -R ). xmlstarlet can also do xpath and xslt script which can help you in extracting your data with a simple shell script.




回答4:


While we were building SerpAPI we tested many platform/parser.

Here is the benchmark result for Python.

For more, here is a full article on Medium: https://medium.com/@vikoky/fastest-html-parser-available-now-f677a68b81dd




回答5:


The efficiency of a regex is superior to a DOM parser.

Look at this comparison:

http://www.rockto.com/launcher/28852/mochien.com/Blog/Read/A300111001736/Regex-VS-DOM-untuk-Rockto-Team

You can find many more searching the web.



来源:https://stackoverflow.com/questions/4893298/fastest-easiest-and-best-way-to-parse-an-html-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!