Python Parsing HTML from url into PD ValueError: No tables found

问题

I'm trying to parse the below HTML into a dataframe and i keep getting error, eventhough i can clearly see a table defined in the HTML. Appreciate your help

<table><tr><td><a <table><tr><td><a

Error

ValueError: No tables found

My code

import pandas as pd 
url='http://rssfeeds.s3.amazonaws.com/goldbox?'
#dfs = pd.read_html(requests.get(url).text)
dfs = pd.read_html(url)
dfs[0].head()

Also tried with feedparser and no luck. I dont get any data

import feedparser
import pandas as pd
import time

rawrss = ('http://rssfeeds.s3.amazonaws.com/goldbox')
    
posts = []
for url in rawrss:
    feed = feedparser.parse(url)
    for post in feed.entries:
        posts.append((post.title, post.dealUrl, post.discountPercentage))
df = pd.DataFrame(posts, columns=['title', 'dealUrl', 'discountPercentage'])
df.tail()

回答1:

The amount of data on this page is too large to time out. In addition, the content I got seems to be different from yours.

import pandas as pd
from simplified_scrapy import SimplifiedDoc, utils, req
html = req.get('http://rssfeeds.s3.amazonaws.com/goldbox',
               timeout=600)

posts = {'title': [], 'link': [], 'description': []}
doc = SimplifiedDoc(html)
items = doc.selects('item')
for item in items:
    posts['title'].append(item.title.text)
    posts['link'].append(item.link.text)
    posts['description'].append(item.description.text)

df = pd.DataFrame(posts)
df.tail()

Get data from description

posts = {'listPrice': [], 'dealPrice': [], 'expires': []}
doc = SimplifiedDoc(html)
descriptions = doc.selects('item').description # Get all descriptions
for table in descriptions:
    d = SimplifiedDoc(table.unescape()) # Using description to build a doc object
    img = d.img.src # Get the image src
    listPrice = d.getElementByText('List Price:')
    if listPrice:
        listPrice=listPrice.strike.text
    else: listPrice = ''

    dealPrice = d.getElementByText('Deal Price: ')
    if dealPrice:
        dealPrice = dealPrice.text[len('Deal Price: '):]
    else: dealPrice = ''

    expires = d.getElementByText('Expires ')
    if expires:
        expires = expires.text[len('Expires '):]
    else: expires = ''

    posts['listPrice'].append(listPrice)
    posts['dealPrice'].append(dealPrice)
    posts['expires'].append(expires)
df = pd.DataFrame(posts)
df.tail()

The page data I get is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>Amazon.com Gold Box Deals</title>
    <link>http://www.amazon.com/gp/goldbox</link>
    <description>Amazon.com Gold Box Deals</description>
    <pubDate>Thu, 28 Jun 2018 08:50:16 GMT</pubDate>
    <dc:date>2018-06-28T08:50:16Z</dc:date>
    <image>
      <title>Amazon.com Gold Box Deals</title>
      <url>http://images.amazon.com/images/G/01/rcm/logo2.gif</url>
      <link>http://www.amazon.com/gp/goldbox</link>
    </image>
    <item>
      <title>Deal of the Day: Withings Activit? Steel - Activity and Sleep Tracking Watch</title>
      <link>https://www.amazon.com/Withings-Activit%C3%83-Steel-Activity-Tracking/dp/B018SL790Q/ref=xs_gb_rss_ADSW6RT7OG27P/?ccmID=380205&amp;tag=rssfeeds-20</link>
      <description>&lt;table&gt;&lt;tr&gt;&lt;td&gt;&lt;a href="https://www.amazon.com/Withings-Activit%C3%83-Steel-Activity-Tracking/dp/B018SL790Q/ref=xs_gb_rss_ADSW6RT7OG27P/?ccmID=380205&amp;tag=rssfeeds-20" target="_blank"&gt;&lt;img src="https://images-na.ssl-images-amazon.com/images/I/41O4Qc3FCBL._SL160_.jpg" alt="Product Image" style='border:0'/&gt;&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;tr&gt;&lt;td&gt;Withings Activit? Steel - Activity and Sleep Tracking Watch&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Expires Jun 29, 2018&lt;/td&gt;&lt;/tr&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</description>
      <pubDate>Thu, 28 Jun 2018 07:00:10 GMT</pubDate>
      <guid isPermaLink="false">http://promotions.amazon.com/gp/goldbox/</guid>
      <dc:date>2018-06-28T07:00:10Z</dc:date>
    </item>

来源：https://stackoverflow.com/questions/63122779/python-parsing-html-from-url-into-pd-valueerror-no-tables-found

标签

python

dataframe

rss-reader