Scrape Multiple URLs using Beautiful Soup

前端 未结 2 1482
南笙
南笙 2021-02-02 02:56

I\'m trying to extract specific classes from multiple URLs. The tags and classes stay the same but I need my python program to scrape all as I just input my link.

Here\'

相关标签:
2条回答
  • 2021-02-02 03:17

    Have a list of urls and iterate through it.

    from bs4 import BeautifulSoup
    import requests
    import pprint
    import re
    import pyperclip
    
    urls = ['www.website1.com', 'www.website2.com', 'www.website3.com', .....]
    #scrape elements
    for url in urls:
        response = requests.get(url)
        soup = BeautifulSoup(response.content, "html.parser")
    
        #print titles only
        h1 = soup.find("h1", class_= "class-headline")
        print(h1.get_text())
    

    If you are going to prompt user for input for each site then it can be done this way

    from bs4 import BeautifulSoup
    import requests
    import pprint
    import re
    import pyperclip
    
    urls = ['www.website1.com', 'www.website2.com', 'www.website3.com', .....]
    #scrape elements
    msg = 'Enter Url, to exit type q and hit enter.'
    url = input(msg)
    while(url!='q'):
        response = requests.get(url)
        soup = BeautifulSoup(response.content, "html.parser")
    
        #print titles only
        h1 = soup.find("h1", class_= "class-headline")
        print(h1.get_text())
        input(msg)
    
    0 讨论(0)
  • 2021-02-02 03:18

    If you want to scrape links in batches. Specify a batch size and iterate over it.

    from bs4 import BeautifulSoup
    import requests
    import pprint
    import re
    import pyperclip
    
    batch_size = 5
    urllist = ["url1", "url2", "url3", .....]
    url_chunks = [urllist[x:x+batch_size] for x in xrange(0, len(urllist), batch_size)]
    
    def scrape_url(url):
        response = requests.get(url)
        soup = BeautifulSoup(response.content, "html.parser")
        h1 = soup.find("h1", class_= "class-headline")
        return (h1.get_text())
    
    def scrape_batch(url_chunk):
        chunk_resp = []
        for url in url_chunk:
            chunk_resp.append(scrape_url(url))
        return chunk_resp
    
    for url_chunk in url_chunks:
        print scrape_batch(url_chunk)
    
    0 讨论(0)
提交回复
热议问题