Python: Newspaper Module - Any way to pool getting articles straight from URLs?

后端未结

关注

 4  1053

广开言路 2021-01-01 04:06

I\'m using the Newspaper module for python found here.

In the tutorials, it describes how you can pool the building of different newspapers s.t. it generates them at

4条回答

小蘑菇 (楼主)

2021-01-01 04:35

To build upon Joseph's Valls answer. I'm assuming the original poster wanted to use multithreading to extract a bunch of data and store it somewhere properly. After much trying, I think I have found a solution, it may not be the most efficient but it works, I've tried to make it better however, I think the newspaper3k plugin could be a bit buggy. However, this works in extracting the desired elements to a DataFrame.

import newspaper
from newspaper import Article
from newspaper import Source
import pandas as pd

gamespot_paper = newspaper.build('https://www.gamespot.com/news/', memoize_articles=False)
bbc_paper = newspaper.build("https://www.bbc.com/news", memoize_articles=False)
papers = [gamespot_paper, bbc_paper]
news_pool.set(papers, threads_per_source=4) 
news_pool.join()

#Create our final dataframe
df_articles = pd.DataFrame()

#Create a download limit per sources
limit = 100

for source in papers:
    #tempoary lists to store each element we want to extract
    list_title = []
    list_text = []
    list_source =[]

    count = 0

    for article_extract in source.articles:
        article_extract.parse()

        if count > limit:
            break

        #appending the elements we want to extract
        list_title.append(article_extract.title)
        list_text.append(article_extract.text)
        list_source.append(article_extract.source_url)

        #Update count
        count +=1


    df_temp = pd.DataFrame({'Title': list_title, 'Text': list_text, 'Source': list_source})
    #Append to the final DataFrame
    df_articles = df_articles.append(df_temp, ignore_index = True)
    print('source extracted')

Please do suggest any improvements!

0 讨论(0)

查看其它4个回答