python requests not getting full page

问题

"""THIS IS MY CODE """

import requests
from bs4 import BeautifulSoup
import random
from selenium import webdriver
url ="http://www.yopmail.com/en/?smith"
request = requests.get(url)
soup = BeautifulSoup(request.text, 'html5lib')
print(soup)

"""IT RETURNING THIS OUTPUT """

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
</head>
<body onload="document.getElementById('f').submit();">
<form action="." id="f" method="post">
<input id="yp" name="yp" type="hidden" value="XAQHlAwL5ZwL1ZQZlAGH3ZGV"/>
<input id="login" name="login" type="hidden" value="smith"/>
<input id="id" name="id" type="hidden" value=""/>
</form>
<noscript><br/><br/>  <strong>Your browser does not support javascript or it may be disabled</strong></noscript>

</body></html>

""" I WANT WHOLE SRC CODE INSTEAD OF THIS"""

回答1:

This happens because the request is getting the source before Javascript is executed. You can install requests-html and import HTMLSession from requests_html. Supported features:

Full JavaScript support!
CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).
XPath Selectors, for the faint of heart.
Mocked user-agent (like a real web browser).
Automatic following of redirects.
Connection–pooling and cookie persistence.
The Requests experience you know and love, with magical parsing abilities.
Async Support

Example:

pip install requests-html

from requests_html import HTMLSession
from requests_html import AsyncHTMLSession

url2search = "https://******"
session = HTMLSession()
r = session.get(url2search)

Render for JS as:

r.html.render()

Note, the first time you ever run the render() method, it will download Chromium into your home directory (e.g. ~/.pyppeteer/). This only happens once. You may also need to install a few Linux packages to get pyppeteer working.

More details on this link.

回答2:

I'd rather wanted to write this as a comment than an answer, as I'm only giving you a hint, but I don't have enough reputation to write comments. So here are my two cents:

Notice the lines

<body onload="document.getElementById('f').submit();">
<form action="." id="f" method="post">

in that HTML source of yours. It might be a very basic protection against scraping attempts like you intend on doing, and it might be sufficient to change your usage of requests.get to requests.post instead; including changing GET-like parameter

/?smith

in the URL to a POST parameter instead.

But just as well you might encounter even more code afterwards that requires you to be able to use JavaScript, though. Check the other answer by Basu_C in that case.

来源：https://stackoverflow.com/questions/60416507/python-requests-not-getting-full-page

标签

python

web

beautifulsoup

request

screen-scraping