Python: Newspaper Module - Any way to pool getting articles straight from URLs?

后端 未结 4 1053
广开言路
广开言路 2021-01-01 04:06

I\'m using the Newspaper module for python found here.

In the tutorials, it describes how you can pool the building of different newspapers s.t. it generates them at

4条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2021-01-01 04:35

    To build upon Joseph's Valls answer. I'm assuming the original poster wanted to use multithreading to extract a bunch of data and store it somewhere properly. After much trying, I think I have found a solution, it may not be the most efficient but it works, I've tried to make it better however, I think the newspaper3k plugin could be a bit buggy. However, this works in extracting the desired elements to a DataFrame.

    import newspaper
    from newspaper import Article
    from newspaper import Source
    import pandas as pd
    
    gamespot_paper = newspaper.build('https://www.gamespot.com/news/', memoize_articles=False)
    bbc_paper = newspaper.build("https://www.bbc.com/news", memoize_articles=False)
    papers = [gamespot_paper, bbc_paper]
    news_pool.set(papers, threads_per_source=4) 
    news_pool.join()
    
    #Create our final dataframe
    df_articles = pd.DataFrame()
    
    #Create a download limit per sources
    limit = 100
    
    for source in papers:
        #tempoary lists to store each element we want to extract
        list_title = []
        list_text = []
        list_source =[]
    
        count = 0
    
        for article_extract in source.articles:
            article_extract.parse()
    
            if count > limit:
                break
    
            #appending the elements we want to extract
            list_title.append(article_extract.title)
            list_text.append(article_extract.text)
            list_source.append(article_extract.source_url)
    
            #Update count
            count +=1
    
    
        df_temp = pd.DataFrame({'Title': list_title, 'Text': list_text, 'Source': list_source})
        #Append to the final DataFrame
        df_articles = df_articles.append(df_temp, ignore_index = True)
        print('source extracted')
    

    Please do suggest any improvements!

提交回复
热议问题