I\'m using the Newspaper module for python found here.
In the tutorials, it describes how you can pool the building of different newspapers s.t. it generates them at
To build upon Joseph's Valls answer. I'm assuming the original poster wanted to use multithreading to extract a bunch of data and store it somewhere properly. After much trying, I think I have found a solution, it may not be the most efficient but it works, I've tried to make it better however, I think the newspaper3k plugin could be a bit buggy. However, this works in extracting the desired elements to a DataFrame.
import newspaper
from newspaper import Article
from newspaper import Source
import pandas as pd
gamespot_paper = newspaper.build('https://www.gamespot.com/news/', memoize_articles=False)
bbc_paper = newspaper.build("https://www.bbc.com/news", memoize_articles=False)
papers = [gamespot_paper, bbc_paper]
news_pool.set(papers, threads_per_source=4)
news_pool.join()
#Create our final dataframe
df_articles = pd.DataFrame()
#Create a download limit per sources
limit = 100
for source in papers:
#tempoary lists to store each element we want to extract
list_title = []
list_text = []
list_source =[]
count = 0
for article_extract in source.articles:
article_extract.parse()
if count > limit:
break
#appending the elements we want to extract
list_title.append(article_extract.title)
list_text.append(article_extract.text)
list_source.append(article_extract.source_url)
#Update count
count +=1
df_temp = pd.DataFrame({'Title': list_title, 'Text': list_text, 'Source': list_source})
#Append to the final DataFrame
df_articles = df_articles.append(df_temp, ignore_index = True)
print('source extracted')
Please do suggest any improvements!