How to scrape newspaper articles from website using selenium and beautifulsoup in python?

后端 未结 3 927
南笙
南笙 2021-01-16 23:44

I am trying to collect the date, title, and content from the newspaper (the new york times).

Date and title I got, but the full article I couldn\'t able to. Below i

3条回答
  •  走了就别回头了
    2021-01-17 00:16

    You are observing only the first page of search. Where you have the list of articles. To get the content of article you have to send request to the article and fetch content from there.

    Here I am fetching Title, Author, Publish Date, Content and storing them in a list. From that list we can create a DataFrame later, if required.

    newyork_times_list = []
    for a in search_results.find_all('a', href=True):
    
        newyork_times = {}
        page_url = "https://www.nytimes.com" + a['href']
    
        try:
            # URL
            newyork_times['URL'] =  page_url
    
            # Invoke URL
            page = requests.get(page_url)
            page_soup = BeautifulSoup(page.content, 'lxml')
    
            # Title
            newyork_times['Title'] = page_soup.find('title').text
    
            # Content
            page_content = ''
            page_soup_div = page_soup.find_all("div", {"class":"StoryBodyCompanionColumn"})
            for p_content in page_soup_div:
                page_content = page_content + p_content.text
    
            # Content
            newyork_times['Content'] =  page_content
    
            # Date Time
            page_soup_span = page_soup.find_all("time")
            newyork_times['Publish Date'] = page_soup_span[0].text
    
            # Author
            page_soup_span = page_soup.find_all("span", {"itemprop": "name"})
            newyork_times['Author'] =  page_soup_span[0].text
    
            newyork_times_list.append(newyork_times)
    
            print('Processed', page_url)
        except:
            print('ERROR!', page_url)
    
    print('Done')
    

提交回复
热议问题