Scraping new ESPN site using xpath [Python]

前端 未结 1 1993
难免孤独
难免孤独 2021-02-11 08:23

I am trying to scrape the new ESPN NBA scoreboard. Here is a simple script which should return the start times for all games on 4/5/15:

import requests
import lx         


        
1条回答
  •  礼貌的吻别
    2021-02-11 09:18

    The nature of the page is quite dynamic - there are asynchronous XHR requests, javascript logic involved. requests is not a browser and downloads only the initial HTML page and there are no span elements with class="time" in the HTML that requests gets.

    One of the options to approach the problem would be to involve a real browser using selenium. Here is an example using PhantomJS headless browser:

    >>> from selenium import webdriver
    >>> 
    >>> url = "http://scores.espn.go.com/nba/scoreboard?date=20150405"
    >>> 
    >>> driver = webdriver.PhantomJS()
    >>> driver.get(url)
    >>> 
    >>> elements = driver.find_elements_by_css_selector("span.time")
    >>> for element in elements:
    ...     print element.text
    ... 
    
    1:00 PM ET
    3:30 PM ET
    6:00 PM ET
    7:00 PM ET
    7:30 PM ET
    9:00 PM ET
    9:30 PM ET 
    

    Alternatively, you can look for the desired data in the data-data attribute of the div with id="scoreboard-page":

    import json
    from pprint import pprint
    
    import lxml.html
    import requests
    
    response = requests.get('http://scores.espn.go.com/nba/scoreboard?date=20150405')
    doc = lxml.html.fromstring(response.content)
    
    data = doc.xpath("//div[@id='scoreboard-page']/@data-data")[0]
    data = json.loads(data)
    
    pprint(data)
    

    0 讨论(0)
提交回复
热议问题