Get Feeds from FeedParser and Import to Pandas DataFrame

孤街浪徒 提交于 2021-02-05 20:36:45

问题


I'm learning python. As practice I'm building a rss scraper with feedparser putting the output into a pandas dataframe and trying to mine with NLTK...but I'm first getting a list of articles from multiple RSS feeds.

I used this post on how to pass multiple feeds and combined it with an answer I got previously to another question on how to get it into a Pandas dataframe.

What the problem is, I want to be able to see the data from all the feeds in my dataframe. Currently I'm only able to access the first item in the list of feeds.

FeedParser seems to be doing it's job but when putting it into the Pandas df it only seems to grab the first RSS in the list.

import feedparser
import pandas as pd

rawrss = [
    'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
    'https://www.yahoo.com/news/rss/',
    'http://www.huffingtonpost.co.uk/feeds/index.xml',
    'http://feeds.feedburner.com/TechCrunch/',
    ]

feeds = []
for url in rawrss:
    feeds.append(feedparser.parse(url))

for feed in feeds:
    for post in feed.entries:
        print(post.title, post.link, post.summary)

df = pd.DataFrame(columns=['title', 'link', 'summary'])

for i, post in enumerate(feed.entries):
    df.loc[i] =  post.title, post.link, post.summary

df.shape

df

回答1:


Your code will loop through each post and print its data. The part of your code that adds the post data to the dataframe is not part of the loop (in python indentation is meaningful!), so you only see the data from one feed in your dataframe.

You can build a list of posts as you loop through the feeds, and then create a dataframe at the end:

import feedparser
import pandas as pd

rawrss = [
    'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
    'https://www.yahoo.com/news/rss/',
    'http://www.huffingtonpost.co.uk/feeds/index.xml',
    'http://feeds.feedburner.com/TechCrunch/',
    ]

feeds = [] # list of feed objects
for url in rawrss:
    feeds.append(feedparser.parse(url))

posts = [] # list of posts [(title1, link1, summary1), (title2, link2, summary2) ... ]
for feed in feeds:
    for post in feed.entries:
        posts.append((post.title, post.link, post.summary))

df = pd.DataFrame(posts, columns=['title', 'link', 'summary']) # pass data to init

You could optimize this a little bit by combining the two for loops:

posts = []
for url in rawrss:
    feed = feedparser.parse(url)
    for post in feed.entries:
        posts.append((post.title, post.link, post.summary))


来源:https://stackoverflow.com/questions/45701053/get-feeds-from-feedparser-and-import-to-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!