Python - save requests or BeautifulSoup object locally

后端 未结 2 1766
一整个雨季
一整个雨季 2021-01-18 01:52

I have some code that is quite long, so it takes a long time to run. I want to simply save either the requests object (in this case \"name\") or the BeautifulSoup object (i

相关标签:
2条回答
  • 2021-01-18 02:37

    Storing requests locally and restoring them as Beautifoul Soup object latter on

    If you are iterating through pages of web site you can store each page with request explained here. Create folder soupCategory in same folder where your script is.

    Use any latest user agent for headers

    headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0 Safari/605.1.15'}
    
    def getCategorySoup():
        session = requests.Session()
        retry = Retry(connect=7, backoff_factor=0.5)
    
        adapter = HTTPAdapter(max_retries=retry)
        session.mount('http://', adapter)
        session.mount('https://', adapter)
    
        basic_url = "https://www.somescrappingdomain.com/apartments?adsWithImages=1&page="    
        t0 = time.time() 
        j=0    
        totalPages = 1525 # put your number of pages here        
        for i in range(1,totalPages):         
            url = basic_url+str(i)
            r  = requests.get(url, headers=headers)
            pageName = "./soupCategory/"+str(i)+".html"
            with open(pageName, mode='w', encoding='UTF-8', errors='strict', buffering=1) as f:
                f.write(r.text)        
                print (pageName, end=" ")
        t1 = time.time()
        total = t1-t0
        print ("Total time for getting ",totalPages," category pages is ", round(total), " seconds")
        return 
    

    Latter on you can create Beautifoul Soup object as @merlin2011 mentioned with:

    with open("/soupCategory/1.html") as f:
      soup = BeautifulSoup(f)
    
    0 讨论(0)
  • 2021-01-18 02:47

    Since name.content is just HTML, you can just dump this to a file and read it back later.

    Usually the bottleneck is not the parsing, but instead the network latency of making requests.

    from bs4 import BeautifulSoup
    import requests
    
    url = 'https://google.com'
    name = requests.get(url)
    
    with open("/tmp/A.html", "w") as f:
      f.write(name.content)
    
    
    # read it back in
    with open("/tmp/A.html") as f:
      soup = BeautifulSoup(f)
      # do something with soup
    

    Here is some anecdotal evidence for the fact that bottleneck is in the network.

    from bs4 import BeautifulSoup
    import requests
    import time
    
    url = 'https://google.com'
    
    t1 = time.clock();
    name = requests.get(url)
    t2 = time.clock();
    soup = BeautifulSoup(name.content)
    t3 = time.clock();
    
    print t2 - t1, t3 - t2
    

    Output, from running on Thinkpad X1 Carbon, with a fast campus network.

    0.11 0.02
    
    0 讨论(0)
提交回复
热议问题