How to iterate through multiple results pages when web scraping with Beautiful Soup

后端 未结 1 1082
余生分开走
余生分开走 2021-01-25 21:17

I have a script that i have written where i use Beautiful Soup to scrape a website for search results. I have managed to isolate the data that i want via its class name.

相关标签:
1条回答
  • 2021-01-25 22:00

    Try this:

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    
    #all_letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o","p","q","r","s","t","u","v", "w", "x", "y", "z", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
    all_letters= ['x']
    pages = []
    
    def get_url(letter, page_number):
        return "https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d" + letter + "&page=" + str (page_number)
    
    def list_names(soup):
        nameList = soup.findAll("td", {"class":"party-name"})
        for name in nameList:
            print(name.get_text())
    
    def get_soup(letter, page):
        url = get_url(letter, page)
        html = urlopen(url)
        return BeautifulSoup(html)
    
    def main():
        for letter in all_letters:
            bsObj = get_soup(letter, 1)
    
            sel = bsObj.find('select', {"name": "ctl00$ctl00$InternetApplication_Body$WebApplication_Body$SearchResultPageList1"})    
            for opt in sel.findChildren("option", selected = lambda x: x != "selected"):
                pages.append(opt.string)
    
            list_names(bsObj)
    
            for page in pages:
                bsObj = get_soup(letter, page)
                list_names(bsObj)
    main()
    

    In the main() function, from first page get_soup(letter, 1) we find and store in a list the select options values that contains all page numbers.

    Next, we loop over page numbers to extract data from next pages.

    0 讨论(0)
提交回复
热议问题