问题
I am using Pandas to parse the data from the following page: http://kenpom.com/index.php?y=2014
To get the data, I am writing:
dfs = pd.read_html(url)
The data looks great and is perfectly parsed, except it only takes data from the 40 first rows. It seems to be a problem with the separation of the tables, that makes it so that pandas does no get all the information.
How do you get pandas to get all the data from all the tables on that webpage?
回答1:
The HTML of page you have posted have multiple <thead>
and <tbody>
tags wich confuses pandas.read_html.
Following this SO thread you can manually unwrap those tags:
import urllib
from bs4 import BeautifulSoup
html_table = urllib.request.urlopen(url).read()
# fix HTML
soup = BeautifulSoup(html_table, "html.parser")
# warn! id ratings-table is your page specific
for table in soup.findChildren(attrs={'id': 'ratings-table'}):
for c in table.children:
if c.name in ['tbody', 'thead']:
c.unwrap()
df = pd.read_html(str(soup), flavor="bs4")
len(df[0])
which returns 369
.
来源:https://stackoverflow.com/questions/42225204/use-pandas-to-get-multiple-tables-from-webpage