How to scrape a page if it is redirected to another before

删除回忆录丶 提交于 2021-02-10 12:18:26

问题


I am trying to scrape some text off of https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms, but as you can see when it loads up the link through web-driver it automatically redirects it to a log in page. After I log in, it then goes straight to the page I want to scrape, but Beautiful Soup just keeps scraping the log in page.

How do I make it so Beautiful Soup scrapes the page I want it to and not the login page?

I have already tried putting a time.sleep() before it scrapes to give me time to log in but that didn't work either.

soup = BeautifulSoup(requests.get("https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms").text, 'html.parser')
while True:
    front_half = soup.find_all(class_='qquestion qtext')
    print(front_half)
    time.sleep(1)

回答1:


What you probably need is a persistent session with requests. This answer probably covers exactly what you need. The general idea is simple:

  1. You open a session and send a request to the website
  2. Send the login post request so it logs you in
  3. Query the url with the same session.

You will need to understand how the login post request is structured and what data is passed (username, email, etc) and create a json with that data.

import requests

url = 'https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms'

session = requests.session()

login_data = {
    'username': ,
    'csrfmiddlewaretoken': ,
    'password': ,
    'next': '/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms'
}

session.get(url) #this will redirect you and it might load some initial cookies info

r = session.post('https://<theurl>/login.py', login_data)

if r.status_code == 200: #if accepted the request
    res = session.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    ## (...) your scraping code


来源:https://stackoverflow.com/questions/57273335/how-to-scrape-a-page-if-it-is-redirected-to-another-before

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!