BeautifulSoup returning incorrect text

后端 未结 1 1092
南方客
南方客 2021-01-21 10:28

I\'m trying to scrape the below site for live tennis scores. When the match is over the elements I\'m scraping changes and I can get the score, but during the match when I searc

1条回答
  •  傲寒
    傲寒 (楼主)
    2021-01-21 11:07

    The webpage is using JavaScript. If you are downloading the URL with urllib, the JavaScript is not getting executed. So much of the HTML you are seeing in the browser is not getting generated.

    One way to execute the JavaScript is to use Selenium. Another way is to use PyQt4:

    import sys
    from PyQt4 import QtWebKit
    from PyQt4 import QtCore
    from PyQt4 import QtGui
    
    class Render(QtWebKit.QWebPage):
        def __init__(self, url):
            self.app = QtGui.QApplication(sys.argv)
            QtWebKit.QWebPage.__init__(self)
            self.loadFinished.connect(self._loadFinished)
            self.mainFrame().load(QtCore.QUrl(url))
            self.app.exec_()
    
        def _loadFinished(self, result):
            self.frame = self.mainFrame()
            self.app.quit()
    
    url = 'http://www.scoreboard.com/game/6LeqhPJd/#game-summary'
    r = Render(url)
    content = unicode(r.frame.toHtml())
    

    Once you have content (after the JavaScript has been executed) you can parse it with an HTML parser (like BeautifulSoup or lxml).

    For example, using lxml:

    import lxml.html as LH
    
    def clean(text):
        return text.replace(u'\xa0', u'')
    
    doc = LH.fromstring(content)   
    result = []
    for tr in doc.xpath('//tr[td[@class="left summary-horizontal"]]'):
        row = []
        for elt in tr.xpath('td'):
            row.append(clean(elt.text_content()))
        result.append(u', '.join(row[1:]))
    print(u'\n'.join(result))
    

    yields

    Chardy J. (Fra), 2, 6, 77, , , , 
    Zeballos H. (Arg), 0, 4, 63, , , , 
    

    Using Selenium and PhantomJS (so that a GUI browser doesn't pop up), this is what the equivalent code would look like:

    import selenium.webdriver as webdriver
    import contextlib
    import os
    import lxml.html as LH
    
    # define path to the phantomjs binary
    phantomjs = os.path.expanduser('~/bin/phantomjs')
    url = 'http://www.scoreboard.com/game/6LeqhPJd/#game-summary'
    with contextlib.closing(webdriver.PhantomJS(phantomjs)) as driver:
        driver.get(url)
        content = driver.page_source
        doc = LH.fromstring(content)   
        result = []
        for tr in doc.xpath('//tr[td[@class="left summary-horizontal"]]'):
            row = []
            for elt in tr.xpath('td'):
                row.append(elt.text_content())
            result.append(u', '.join(row[1:]))
        print(u'\n'.join(result))
    

    Both the Selenium/PhantomJS solution and the PyQt4 solution take about the same amount of time to run.

    0 讨论(0)
提交回复
热议问题