问题
I'm trying to parse the below HTML into a dataframe and i keep getting error, eventhough i can clearly see a table defined in the HTML. Appreciate your help
<table><tr><td><a <table><tr><td><a
Error
ValueError: No tables found
My code
import pandas as pd
url='http://rssfeeds.s3.amazonaws.com/goldbox?'
#dfs = pd.read_html(requests.get(url).text)
dfs = pd.read_html(url)
dfs[0].head()
Also tried with feedparser and no luck. I dont get any data
import feedparser
import pandas as pd
import time
rawrss = ('http://rssfeeds.s3.amazonaws.com/goldbox')
posts = []
for url in rawrss:
feed = feedparser.parse(url)
for post in feed.entries:
posts.append((post.title, post.dealUrl, post.discountPercentage))
df = pd.DataFrame(posts, columns=['title', 'dealUrl', 'discountPercentage'])
df.tail()
回答1:
The amount of data on this page is too large to time out. In addition, the content I got seems to be different from yours.
import pandas as pd
from simplified_scrapy import SimplifiedDoc, utils, req
html = req.get('http://rssfeeds.s3.amazonaws.com/goldbox',
timeout=600)
posts = {'title': [], 'link': [], 'description': []}
doc = SimplifiedDoc(html)
items = doc.selects('item')
for item in items:
posts['title'].append(item.title.text)
posts['link'].append(item.link.text)
posts['description'].append(item.description.text)
df = pd.DataFrame(posts)
df.tail()
Get data from description
posts = {'listPrice': [], 'dealPrice': [], 'expires': []}
doc = SimplifiedDoc(html)
descriptions = doc.selects('item').description # Get all descriptions
for table in descriptions:
d = SimplifiedDoc(table.unescape()) # Using description to build a doc object
img = d.img.src # Get the image src
listPrice = d.getElementByText('List Price:')
if listPrice:
listPrice=listPrice.strike.text
else: listPrice = ''
dealPrice = d.getElementByText('Deal Price: ')
if dealPrice:
dealPrice = dealPrice.text[len('Deal Price: '):]
else: dealPrice = ''
expires = d.getElementByText('Expires ')
if expires:
expires = expires.text[len('Expires '):]
else: expires = ''
posts['listPrice'].append(listPrice)
posts['dealPrice'].append(dealPrice)
posts['expires'].append(expires)
df = pd.DataFrame(posts)
df.tail()
The page data I get is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
<channel>
<title>Amazon.com Gold Box Deals</title>
<link>http://www.amazon.com/gp/goldbox</link>
<description>Amazon.com Gold Box Deals</description>
<pubDate>Thu, 28 Jun 2018 08:50:16 GMT</pubDate>
<dc:date>2018-06-28T08:50:16Z</dc:date>
<image>
<title>Amazon.com Gold Box Deals</title>
<url>http://images.amazon.com/images/G/01/rcm/logo2.gif</url>
<link>http://www.amazon.com/gp/goldbox</link>
</image>
<item>
<title>Deal of the Day: Withings Activit? Steel - Activity and Sleep Tracking Watch</title>
<link>https://www.amazon.com/Withings-Activit%C3%83-Steel-Activity-Tracking/dp/B018SL790Q/ref=xs_gb_rss_ADSW6RT7OG27P/?ccmID=380205&tag=rssfeeds-20</link>
<description><table><tr><td><a href="https://www.amazon.com/Withings-Activit%C3%83-Steel-Activity-Tracking/dp/B018SL790Q/ref=xs_gb_rss_ADSW6RT7OG27P/?ccmID=380205&tag=rssfeeds-20" target="_blank"><img src="https://images-na.ssl-images-amazon.com/images/I/41O4Qc3FCBL._SL160_.jpg" alt="Product Image" style='border:0'/></a></td><td><tr><td>Withings Activit? Steel - Activity and Sleep Tracking Watch</td></tr><tr><td>Expires Jun 29, 2018</td></tr></td></tr></table></description>
<pubDate>Thu, 28 Jun 2018 07:00:10 GMT</pubDate>
<guid isPermaLink="false">http://promotions.amazon.com/gp/goldbox/</guid>
<dc:date>2018-06-28T07:00:10Z</dc:date>
</item>
来源:https://stackoverflow.com/questions/63122779/python-parsing-html-from-url-into-pd-valueerror-no-tables-found