Getting ‘wrong’ page source when calling url from python

后端未结

关注

 3  731

Trying to retrieve the page source from a website, I get a completely different (and shorter) text than when viewing the same page source through a web browser.

htt

相关标签:

3条回答

时光说笑

2020-12-18 00:16
There is a couple of issues here. The root cause is that the website you are trying to scrape knows you're not a real person and is blocking you. Lots of websites do this simply by checking headers to see if a request is coming from a browser or not (robot). However, this site looks like they use Incapsula, which is designed to provide more sophisticated protection. You can try and setup your request differently to fool the security on the page by setting headers - but I doubt this will work.
```
import requests

def get_page_source(n):
    url = 'https://www.whoscored.com/Matches/' + str(n) + '/live'
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    response = requests.get(url, headers=headers)
    return response.text

n = 1006233
text = get_page_source(n)
print text
```
Looks like the site also uses captchas - which are designed to prevent web scraping. If a site is trying this hard to prevent scraping - it's likely because the data they provide is proprietary. I would suggest finding another site that provides this data - or try and use an official API.

Check out this (https://stackoverflow.com/a/17769971/701449) answer from a while back. It looks like the whoscored.com uses the OPTA API to provide info. You may be able to skip the middleman and go straight to the source of the data. Good luck!
0 讨论(0)
发布评论:

提交评论
- 加载中...
猫巷女王i

2020-12-18 00:19
Below is one way of getting around this issue. First time you run the script you might have to type in the captcha in the window opened by the webdriver but after that you should be good to go. You can then use beautifulsoup to navigate the response.
```
from selenium import webdriver

def get_page_source(n):

    wd = webdriver.Chrome("/Users/karlanka/Downloads/Chromedriver")
    url = 'https://www.whoscored.com/Matches/' + str(n) + '/live'

    wd.get(url)

    html_page = wd.page_source
    wd.quit()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
灰色年华

2020-12-18 00:26

You should try to setup the "User-Agent" in the HTTP header.

0 讨论(0)
发布评论:

提交评论
- 加载中...