问题
I'm learning Python. I've set myself a wee goal of building a RSS scraper. I'm trying to gather the Author, Link and Title. From there I want to write to a CSV.
I'm encountering some problems. I've search for the answer since last night but can't seem to find a solution. I do have a feeling that is a bit of knowledge that I'm missing between what feedparser is parsing and moving it to a CSV but I don't have the vocabulary yet to know what to Google.
- How do I remove special characters such as '[' and '''?
- How do I a write author, link and title to a new row when I'm creating the new file?
1 Special Characters
rssurls = 'http://feeds.feedburner.com/TechCrunch/'
techart = feedparser.parse(rssurls)
# feeds = []
# for url in rssurls:
# feedparser.parse(url)
# for feed in feeds:
# for post in feed.entries:
# print(post.title)
# print(feed.entires)
techdeets = [post.author + " , " + post.title + " , " + post.link for post in techart.entries]
techdeets = [y.strip() for y in techdeets]
techdeets
Output: I get the information I need but the .strip tag doesn't strip.
['Darrell Etherington , Spin launches first city-sanctioned dockless bike sharing in Bay Area , http://feedproxy.google.com/~r/Techcrunch/~3/BF74UZWBinI/', 'Ryan Lawler , With $5.3 million in funding, CarDash wants to change how you get your car serviced , http://feedproxy.google.com/~r/Techcrunch/~3/pkamfdPAhhY/', 'Ron Miller , AlienVault plug-in searches for stolen passwords on Dark Web , http://feedproxy.google.com/~r/Techcrunch/~3/VbmdS0ODoSo/', 'Lucas Matney , Firefox for Windows gets native WebVR support, performance bumps in latest update , http://feedproxy.google.com/~r/Techcrunch/~3/j91jQJm-f2E/',...]
2) Writing to CSV
import csv
savedfile = open('/test1.txt', 'w')
savedfile.write(str(techdeets) + "/n")
savedfile.close()
import pandas as pd
df = pd.read_csv('/test1.txt', encoding='cp1252')
df
Output: The output was a dataframe with only 1 row and multiple columns.
回答1:
You are almost there :-)
How about using pandas to create a dataframe first then save it, something like this "continuing from your code":
df = pd.DataFrame(columns=['author', 'title', 'link'])
for i, post in enumerate(techart.entries):
df.loc[i] = post.author, post.title, post.link
then you can save it:
df.to_csv('myfilename.csv', index=False)
OR
you can also write into the dataframe straight from the feedparser entries:
>>> import feedparser
>>> import pandas as pd
>>>
>>> rssurls = 'http://feeds.feedburner.com/TechCrunch/'
>>> techart = feedparser.parse(rssurls)
>>>
>>> df = pd.DataFrame()
>>>
>>> df['author'] = [post.author for post in techart.entries]
>>> df['title'] = [post.title for post in techart.entries]
>>> df['link'] = [post.link for post in techart.entries]
来源:https://stackoverflow.com/questions/45569701/feedparser-removing-special-characters-and-writing-to-csv