问题
I am trying to extract html tables from the following URL .
For example, 2019 Director Compensation Table that is on page 44. I believe the table doesn't have a specific id, such as 'Compensation Table' etc.. To extract the table I can only think of matching column names or keywords such as "Stock Awards" or "All Other Compensation" then grabbing the associated table.
Is there an easy way to extract these tables based on column names? Or maybe an easier way?
Thanks!
I am relatively new at scraping HTML tables.. my code is as follows
from bs4 import BeautifulSoup
import requests
url = 'https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
rows = soup.find_all('tr')
回答1:
Sure you can do that, using pandas
read_html
function using match
and attrs
according to documentation.
import pandas as pd
df = pd.read_html(
"https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm", attrs={'style': 'border-collapse: collapse; width: 100%; font: 9pt Arial, Helvetica, Sans-Serif'}, match="Non-Employee Directors")
print(df)
df[0].to_csv("data.csv", index=False, header=False)
Output: View-Online
来源:https://stackoverflow.com/questions/60979366/extract-html-table-based-on-specific-column-headers-python