Wait page to load before getting data with requests.get in python 3

后端 未结 4 860
陌清茗
陌清茗 2020-11-28 10:40

I have a page that i need to get the source to use with BS4, but the middle of the page takes 1 second(maybe less) to load the content, and requests.get catches the source o

相关标签:
4条回答
  • 2020-11-28 11:00

    I found a way to that !!!

    r = requests.get('https://github.com', timeout=(3.05, 27))
    

    In this, timeout has two values, first one is to set session timeout and the second one is what you need. The second one decides after how much seconds the response is sent. You can calculate the time it takes to populate and then print the data out.

    0 讨论(0)
  • 2020-11-28 11:05

    It doesn't look like a problem of waiting, it looks like the element is being created by JavaScript, requests can't handle dynamically generated elements by JavaScript. A suggestion is to use selenium together with PhantomJS to get the page source, then you can use BeautifulSoup for your parsing, the code shown below will do exactly that:

    from bs4 import BeautifulSoup
    from selenium import webdriver
    
    url = "http://legendas.tv/busca/walking%20dead%20s03e02"
    browser = webdriver.PhantomJS()
    browser.get(url)
    html = browser.page_source
    soup = BeautifulSoup(html, 'lxml')
    a = soup.find('section', 'wrapper')
    

    Also, there's no need to use .findAll if you are only looking for one element only.

    0 讨论(0)
  • 2020-11-28 11:10

    In Python 3, Using the module urllib in practice works better when loading dynamic webpages than the requests module.

    i.e

    import urllib.request
    try:
        with urllib.request.urlopen(url) as response:
    
            html = response.read().decode('utf-8')#use whatever encoding as per the webpage
    except urllib.request.HTTPError as e:
        if e.code==404:
            print(f"{url} is not found")
        elif e.code==503:
            print(f'{url} base webservices are not available')
            ## can add authentication here 
        else:
            print('http error',e)
    
    0 讨论(0)
  • 2020-11-28 11:21

    Just to list my way of doing it, maybe it can be of value for someone:

    max_retries = # some int
    retry_delay = # some int
    n = 1
    ready = 0
    while n < max_retries:
      try:
         response = requests.get('https://github.com')
         if response.ok:
            ready = 1
            break
      except requests.exceptions.RequestException:
         print("Website not availabe...")
      n += 1
      time.sleep(retry_delay)
    
    if ready != 1
      print("Problem")
    else:
      print("All good")
    
    0 讨论(0)
提交回复
热议问题