问题
"""THIS IS MY CODE """
import requests
from bs4 import BeautifulSoup
import random
from selenium import webdriver
url ="http://www.yopmail.com/en/?smith"
request = requests.get(url)
soup = BeautifulSoup(request.text, 'html5lib')
print(soup)
"""IT RETURNING THIS OUTPUT """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
</head>
<body onload="document.getElementById('f').submit();">
<form action="." id="f" method="post">
<input id="yp" name="yp" type="hidden" value="XAQHlAwL5ZwL1ZQZlAGH3ZGV"/>
<input id="login" name="login" type="hidden" value="smith"/>
<input id="id" name="id" type="hidden" value=""/>
</form>
<noscript><br/><br/> <strong>Your browser does not support javascript or it may be disabled</strong></noscript>
</body></html>
""" I WANT WHOLE SRC CODE INSTEAD OF THIS"""
回答1:
This happens because the request is getting the source before Javascript is executed. You can install requests-html and import HTMLSession from requests_html. Supported features:
- Full JavaScript support!
- CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).
- XPath Selectors, for the faint of heart.
- Mocked user-agent (like a real web browser).
- Automatic following of redirects.
- Connection–pooling and cookie persistence.
- The Requests experience you know and love, with magical parsing abilities.
- Async Support
Example:
pip install requests-html
from requests_html import HTMLSession
from requests_html import AsyncHTMLSession
url2search = "https://******"
session = HTMLSession()
r = session.get(url2search)
Render for JS as:
r.html.render()
Note, the first time you ever run the render() method, it will download Chromium into your home directory (e.g. ~/.pyppeteer/). This only happens once. You may also need to install a few Linux packages to get pyppeteer working.
More details on this link.
回答2:
I'd rather wanted to write this as a comment than an answer, as I'm only giving you a hint, but I don't have enough reputation to write comments. So here are my two cents:
Notice the lines
<body onload="document.getElementById('f').submit();">
<form action="." id="f" method="post">
in that HTML source of yours. It might be a very basic protection against scraping attempts like you intend on doing, and it might be sufficient to change your usage of requests.get
to requests.post
instead; including changing GET-like parameter
/?smith
in the URL to a POST parameter instead.
But just as well you might encounter even more code afterwards that requires you to be able to use JavaScript, though. Check the other answer by Basu_C in that case.
来源:https://stackoverflow.com/questions/60416507/python-requests-not-getting-full-page