Python - save requests or BeautifulSoup object locally

后端未结

关注

 2  1769

I have some code that is quite long, so it takes a long time to run. I want to simply save either the requests object (in this case \"name\") or the BeautifulSoup object (i

Storing requests locally and restoring them as Beautifoul Soup object latter on

If you are iterating through pages of web site you can store each page with request explained here. Create folder soupCategory in same folder where your script is.

Use any latest user agent for headers

headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0 Safari/605.1.15'}

def getCategorySoup():
    session = requests.Session()
    retry = Retry(connect=7, backoff_factor=0.5)

    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    basic_url = "https://www.somescrappingdomain.com/apartments?adsWithImages=1&page="    
    t0 = time.time() 
    j=0    
    totalPages = 1525 # put your number of pages here        
    for i in range(1,totalPages):         
        url = basic_url+str(i)
        r  = requests.get(url, headers=headers)
        pageName = "./soupCategory/"+str(i)+".html"
        with open(pageName, mode='w', encoding='UTF-8', errors='strict', buffering=1) as f:
            f.write(r.text)        
            print (pageName, end=" ")
    t1 = time.time()
    total = t1-t0
    print ("Total time for getting ",totalPages," category pages is ", round(total), " seconds")
    return

Latter on you can create Beautifoul Soup object as @merlin2011 mentioned with:

with open("/soupCategory/1.html") as f:
  soup = BeautifulSoup(f)

0 讨论(0)

借酒劲吻你

2021-01-18 02:47

Since name.content is just HTML, you can just dump this to a file and read it back later.

Usually the bottleneck is not the parsing, but instead the network latency of making requests.

from bs4 import BeautifulSoup
import requests

url = 'https://google.com'
name = requests.get(url)

with open("/tmp/A.html", "w") as f:
  f.write(name.content)


# read it back in
with open("/tmp/A.html") as f:
  soup = BeautifulSoup(f)
  # do something with soup

Here is some anecdotal evidence for the fact that bottleneck is in the network.

from bs4 import BeautifulSoup
import requests
import time

url = 'https://google.com'

t1 = time.clock();
name = requests.get(url)
t2 = time.clock();
soup = BeautifulSoup(name.content)
t3 = time.clock();

print t2 - t1, t3 - t2

Output, from running on Thinkpad X1 Carbon, with a fast campus network.

0.11 0.02

0 讨论(0)