I\'m trying to scrape the below site for live tennis scores. When the match is over the elements I\'m scraping changes and I can get the score, but during the match when I searc
The webpage is using JavaScript. If you are downloading the URL with urllib
, the JavaScript is not getting executed. So much of the HTML you are seeing in the browser is not getting generated.
One way to execute the JavaScript is to use Selenium. Another way is to use PyQt4:
import sys
from PyQt4 import QtWebKit
from PyQt4 import QtCore
from PyQt4 import QtGui
class Render(QtWebKit.QWebPage):
def __init__(self, url):
self.app = QtGui.QApplication(sys.argv)
QtWebKit.QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QtCore.QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://www.scoreboard.com/game/6LeqhPJd/#game-summary'
r = Render(url)
content = unicode(r.frame.toHtml())
Once you have content
(after the JavaScript has been executed) you can parse it with an HTML parser (like BeautifulSoup or lxml).
For example, using lxml:
import lxml.html as LH
def clean(text):
return text.replace(u'\xa0', u'')
doc = LH.fromstring(content)
result = []
for tr in doc.xpath('//tr[td[@class="left summary-horizontal"]]'):
row = []
for elt in tr.xpath('td'):
row.append(clean(elt.text_content()))
result.append(u', '.join(row[1:]))
print(u'\n'.join(result))
yields
Chardy J. (Fra), 2, 6, 77, , , ,
Zeballos H. (Arg), 0, 4, 63, , , ,
Using Selenium and PhantomJS (so that a GUI browser doesn't pop up), this is what the equivalent code would look like:
import selenium.webdriver as webdriver
import contextlib
import os
import lxml.html as LH
# define path to the phantomjs binary
phantomjs = os.path.expanduser('~/bin/phantomjs')
url = 'http://www.scoreboard.com/game/6LeqhPJd/#game-summary'
with contextlib.closing(webdriver.PhantomJS(phantomjs)) as driver:
driver.get(url)
content = driver.page_source
doc = LH.fromstring(content)
result = []
for tr in doc.xpath('//tr[td[@class="left summary-horizontal"]]'):
row = []
for elt in tr.xpath('td'):
row.append(elt.text_content())
result.append(u', '.join(row[1:]))
print(u'\n'.join(result))
Both the Selenium/PhantomJS solution and the PyQt4 solution take about the same amount of time to run.