BeautifulSoup returning incorrect text

后端未结

关注

 1  1100

南方客 2021-01-21 10:28

I\'m trying to scrape the below site for live tennis scores. When the match is over the elements I\'m scraping changes and I can get the score, but during the match when I searc

1条回答

傲寒 (楼主)

2021-01-21 11:07

The webpage is using JavaScript. If you are downloading the URL with urllib, the JavaScript is not getting executed. So much of the HTML you are seeing in the browser is not getting generated.

One way to execute the JavaScript is to use Selenium. Another way is to use PyQt4:

import sys
from PyQt4 import QtWebKit
from PyQt4 import QtCore
from PyQt4 import QtGui

class Render(QtWebKit.QWebPage):
    def __init__(self, url):
        self.app = QtGui.QApplication(sys.argv)
        QtWebKit.QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QtCore.QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()

url = 'http://www.scoreboard.com/game/6LeqhPJd/#game-summary'
r = Render(url)
content = unicode(r.frame.toHtml())

Once you have content (after the JavaScript has been executed) you can parse it with an HTML parser (like BeautifulSoup or lxml).

For example, using lxml:

import lxml.html as LH

def clean(text):
    return text.replace(u'\xa0', u'')

doc = LH.fromstring(content)   
result = []
for tr in doc.xpath('//tr[td[@class="left summary-horizontal"]]'):
    row = []
    for elt in tr.xpath('td'):
        row.append(clean(elt.text_content()))
    result.append(u', '.join(row[1:]))
print(u'\n'.join(result))

yields

Chardy J. (Fra), 2, 6, 77, , , , 
Zeballos H. (Arg), 0, 4, 63, , , ,

Using Selenium and PhantomJS (so that a GUI browser doesn't pop up), this is what the equivalent code would look like:

import selenium.webdriver as webdriver
import contextlib
import os
import lxml.html as LH

# define path to the phantomjs binary
phantomjs = os.path.expanduser('~/bin/phantomjs')
url = 'http://www.scoreboard.com/game/6LeqhPJd/#game-summary'
with contextlib.closing(webdriver.PhantomJS(phantomjs)) as driver:
    driver.get(url)
    content = driver.page_source
    doc = LH.fromstring(content)   
    result = []
    for tr in doc.xpath('//tr[td[@class="left summary-horizontal"]]'):
        row = []
        for elt in tr.xpath('td'):
            row.append(elt.text_content())
        result.append(u', '.join(row[1:]))
    print(u'\n'.join(result))

Both the Selenium/PhantomJS solution and the PyQt4 solution take about the same amount of time to run.

0 讨论(0)