问题
I am trying to extract some information from a variety of pages and struggling a bit. This shows my challenge:
import requests
from lxml import html
url = "https://www.soccer24.com/match/C4RB2hO0/#match-summary"
response = requests.get(url)
print(response.content)
If you copy the output into Notepad, you cannot find the value "9.20" anywhere in the output (the Team A odds in the bottom right of the webpage). However, if you open the webpage, do a Save-As and then import it back into Python like this, you can locate and extract the 9.20 value:
with open(r'HUL 1-7 TOT _ Hull - Tottenham _ Match Summary.html', "r") as f:
page = f.read()
tree = html.fromstring(page)
output = tree.xpath('//*[@id="default-odds"]/tbody/tr/td[2]/span/span[2]/span/text()') #the xpath for the TeamA odds or the 9.20 value
output # ['9.20']
Not sure why this work-around works but that is above me. So what I would like to do is save a webpage to my local drive and open it in Python, as above and carry on from there. But how do I replicate the Save-As in Python? This does not work:
import urllib.request
response = urllib.request.urlopen(url)
webContent = response.read().decode('utf-8')
f = open('HUL 1-7 TOT _ Hull - Tottenham _ Match Summary.html', 'w')
f.write(webContent)
f.flush()
f.close()
It gives me a webpage but it is a fraction of the original page...?
回答1:
As @Pedro Lobito said. Page content is generated by javascript
. For this reason you need a module which can run JavaScript. I will choose requests_html
or selenium
.
Requests_html
from requests_html import HTMLSession
url = "https://www.soccer24.com/match/C4RB2hO0/#match-summary"
session = HTMLSession()
response = session.get(url)
response.html.render()
result = response.html.xpath('//*[@id="default-odds"]/tbody/tr/td[2]/span/span[2]/span/text()')
print(result)
#['9.20']
Selenium
from selenium import webdriver
from lxml import html
url = "https://www.soccer24.com/match/C4RB2hO0/#match-summary"
dr = webdriver.Chrome()
try:
dr.get(url)
tree = html.fromstring(dr.page_source)
''' use it when browser closes before loading succeeds
# https://selenium-python.readthedocs.io/waits.html
WebDriverWait(dr, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))
)
'''
output = tree.xpath('//*[@id="default-odds"]/tbody/tr/td[2]/span/span[2]/span/text()') #the xpath for the TeamA odds or the 9.20 value
print(output)
except Exception as e:
raise e
finally:
dr.close()
#['9.20']
来源:https://stackoverflow.com/questions/53544345/save-troublesome-webpage-and-import-back-into-python