问题
I am trying to scrape the CDC website for the data of the last 7 days reported cases for COVID-19. https://covid.cdc.gov/covid-data-tracker/#cases_casesinlast7days I've tried to find the table, by name, id, class, and it always returns as none type. When I print the data scraped, I cant manually locate the table in the html either. Not sure what I'm doing wrong here. Once the data is imported, I need to populate a pandas dataframe to later use for graphing purposes, and export the data table as a csv.
回答1:
You might as well request data from the API directly (check out Network tab in your browser while refreshing the page):
import requests
import pandas as pd
endpoint = "https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData"
data = requests.get(endpoint, params={"id": "US_MAP_DATA"}).json()
df = pd.DataFrame(data["US_MAP_DATA"])
EDIT: Trying to make this answer more general and useful.
How did you discern that this was how to parse the data?
Firstly, you need to inspect the page (Ctrl + Shift + I) and navigate to network tab:
Secondly, you need to refresh the page to record network activity.
Where to look?
Check XHR to limit number of records (1);
Look through the records by clicking on them (2) and check their preview responses (3) to find out if it's the data you need.
It doesn't always work but when it does, parsing data from API directly is so much easier than writing scrapers via requests / bs4 / selenium etc and should be the first choice.
来源:https://stackoverflow.com/questions/64406533/python-how-do-i-use-beautifulsoup-to-parse-a-table-into-a-pandas-dataframe