Python web crawler sometimes returns half of the source code, sometimes all of it… From the same website

爱⌒轻易说出口 提交于 2020-01-17 13:45:55

问题


I have a spreadsheet of patent numbers that I'm getting extra data for by scraping Google Patents, the USPTO website, and a few others. I mostly have it running, but there's one thing I've been stuck on all day. When I go for the USPTO site and get the source code it will sometimes give me the whole thing and work wonderfully, but other times it only gives me about the second half (and what I'm looking for is in the first).

searched around here quite a bit, and I haven't seen anyone with this exact issue. Here's the relevant piece of code (it's got some redundancies since I've been trying to figure this out for a while now, but I'm sure that's the least of its problems):

from bs4 import BeautifulSoup
import html5lib
import re
import csv
import urllib
import requests

# This is the base URL for Google Patents
gpatbase = "https://www.google.com/patents/US"
ptobase = "http://patft.uspto.gov/netacgi/nph-Parser?Sect2=PTO1&Sect2=HITOFF&p=1&u=/netahtml/PTO/search-bool.html&r=1&f=G&l=50&d=PALL&RefSrch=yes&Query=PN/"

# Bring in the patent numbers and define the writer we'll use to add the new info we get
with open(r'C:\Users\Filepathblahblahblah\Patent Data\scrapeThese.csv', newline='') as csvfile:
patreader = csv.reader(csvfile)
writer = csv.writer(csvfile)

for row in patreader:
    patnum = row[0]
    #print(row)

    print(patnum)
    # Take each patent and append it to the base URL to get the actual one
    gpaturl = gpatbase + patnum
    ptourl = ptobase + patnum


    gpatreq = requests.get(gpaturl)
    gpatsource = gpatreq.text
    soup = BeautifulSoup(gpatsource, "html5lib")

    # Find the number of academic citations on that patent

    # From the Google Patents page, find the link labeled USPTO and extract the url
    for tag in soup.find_all("a"):
        if tag.next_element == "USPTO":
            uspto_link = tag.get('href')

    #uspto_link = ptourl
    requested = urllib.request.urlopen(uspto_link)
    source = requested.read()

    pto_soup = BeautifulSoup(source, "html5lib")

    print(uspto_link)
    # From the USPTO page, find the examiner's name and save it
    for italics in pto_soup.find_all("i"):
        if italics.next_element == "Primary Examiner:":
            prim = italics.next_element
        else:
            prim = "Not found"

    if prim != "Not found":
        examiner = prim.next_element
    else:
        examiner = "Not found"

    print(examiner)

As of now, it's about 50-50 on whether I'll get the examiner name or "Not found," and I don't see anything that the members of either group have in common with each other, so I'm all out of ideas.


回答1:


I still don't know what's causing the issue, but if someone has a similar problem I was able to figure out a workaround. If you send the source code to a text file instead of trying to work with it directly, it won't be cut off. I guess the issue comes after the data is downloaded, but before it's imported to the 'workspace'. Here's the piece of code I wrote into the scraper:

 if examiner == "Examiner not found":
        filename = r'C:\Users\pathblahblahblah\Code and Output\Scraped Source Code\scraper_errors_' + patnum + '.html'
        sys.stdout = open(filename, 'w')
        print(patnum)
        print(pto_soup.prettify())
        sys.stdout = console_out

        # Take that logged code and find the examiner name
        sec = "Not found"
        prim = "Not found"
        scraped_code = open(r'C:\Users\pathblahblahblah\Code and Output\Scraped Source Code\scraper_errors_' + patnum + '.txt')

        scrapedsoup = BeautifulSoup(scraped_code.read(), 'html5lib')
        # Find all italics (<i>) tags
        for italics in scrapedsoup.find_all("i"):
            for desc in italics.descendants:
                # Check to see if any of them affect the words "Primary Examiner"
                if "Primary Examiner:" in desc:
                    prim = desc.next_element.strip()
                    #print("Primary found: ", prim)
                else:
                    pass
                # Same for "Assistant Examiner"
                if "Assistant Examiner:" in desc:
                    sec = desc.next_element.strip()
                    #print("Assistant found: ", sec)
                else:
                    pass

        # If "Secondary Examiner" in there, set 'examiner' to the next string 
        # If there is no secondary examiner, use the primary examiner
        if sec != "Not found":
            examiner = sec
        elif prim != "Not found":
            examiner = prim
        else:
            examiner = "Examiner not found"
        # Show new results in the console
        print(examiner)


来源:https://stackoverflow.com/questions/31820059/python-web-crawler-sometimes-returns-half-of-the-source-code-sometimes-all-of-i

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!