How to scrape newspaper articles from website using selenium and beautifulsoup in python?

后端未结

关注

 3  927

南笙 2021-01-16 23:44

I am trying to collect the date, title, and content from the newspaper (the new york times).

Date and title I got, but the full article I couldn\'t able to. Below i

3条回答

走了就别回头了 (楼主)

2021-01-17 00:16

You are observing only the first page of search. Where you have the list of articles. To get the content of article you have to send request to the article and fetch content from there.

Here I am fetching Title, Author, Publish Date, Content and storing them in a list. From that list we can create a DataFrame later, if required.

newyork_times_list = []
for a in search_results.find_all('a', href=True):

    newyork_times = {}
    page_url = "https://www.nytimes.com" + a['href']

    try:
        # URL
        newyork_times['URL'] =  page_url

        # Invoke URL
        page = requests.get(page_url)
        page_soup = BeautifulSoup(page.content, 'lxml')

        # Title
        newyork_times['Title'] = page_soup.find('title').text

        # Content
        page_content = ''
        page_soup_div = page_soup.find_all("div", {"class":"StoryBodyCompanionColumn"})
        for p_content in page_soup_div:
            page_content = page_content + p_content.text

        # Content
        newyork_times['Content'] =  page_content

        # Date Time
        page_soup_span = page_soup.find_all("time")
        newyork_times['Publish Date'] = page_soup_span[0].text

        # Author
        page_soup_span = page_soup.find_all("span", {"itemprop": "name"})
        newyork_times['Author'] =  page_soup_span[0].text

        newyork_times_list.append(newyork_times)

        print('Processed', page_url)
    except:
        print('ERROR!', page_url)

print('Done')

0 讨论(0)

查看其它3个回答