How to scrape a page if it is redirected to another before

问题

I am trying to scrape some text off of https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms, but as you can see when it loads up the link through web-driver it automatically redirects it to a log in page. After I log in, it then goes straight to the page I want to scrape, but Beautiful Soup just keeps scraping the log in page.

How do I make it so Beautiful Soup scrapes the page I want it to and not the login page?

I have already tried putting a time.sleep() before it scrapes to give me time to log in but that didn't work either.

soup = BeautifulSoup(requests.get("https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms").text, 'html.parser')
while True:
    front_half = soup.find_all(class_='qquestion qtext')
    print(front_half)
    time.sleep(1)

回答1:

What you probably need is a persistent session with requests. This answer probably covers exactly what you need. The general idea is simple:

You open a session and send a request to the website
Send the login post request so it logs you in
Query the url with the same session.

You will need to understand how the login post request is structured and what data is passed (username, email, etc) and create a json with that data.

import requests

url = 'https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms'

session = requests.session()

login_data = {
    'username': ,
    'csrfmiddlewaretoken': ,
    'password': ,
    'next': '/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms'
}

session.get(url) #this will redirect you and it might load some initial cookies info

r = session.post('https://<theurl>/login.py', login_data)

if r.status_code == 200: #if accepted the request
    res = session.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    ## (...) your scraping code

来源：https://stackoverflow.com/questions/57273335/how-to-scrape-a-page-if-it-is-redirected-to-another-before

标签

python

html

web-scraping

beautifulsoup