问题
I'm trying to scrape data from income statements on Yahoo Finance using Python. Specifically, let's say I want the most recent figure of Net Income of Apple.
The data is structured in a bunch of nested HTML-tables. I am using the requests module to access it and retrieve the HTML.
I am using BeautifulSoup 4 to sift through the HTML-structure, but I can't figure out how to get the figure.
Here is a screenshot of the analysis with Firefox.
My code so far:
from bs4 import BeautifulSoup
import requests
myurl = "https://finance.yahoo.com/q/is?s=AAPL&annual"
html = requests.get(myurl).content
soup = BeautifulSoup(html)
I tried using
all_strong = soup.find_all("strong")
And then get the 17th element, which happens to be the one containing the figure I want, but this seems far from elegant. Something like this:
all_strong[16].parent.next_sibling
...
Of course, the goal is to use BeautifulSoup to search for the Name of the figure I need (in this case "Net Income") and then grab the figures themselves in the same row of the HTML-table.
I would really appreciate any ideas on how to solve this, keeping in mind that I would like to apply the solution to retrieve a bunch of other data from other Yahoo Finance pages.
SOLUTION / EXPANSION:
The solution by @wilbur below worked and I expanded upon it to be able to get the values for any figure available on any of the financials pages (i. e. Income Statement, Balance Sheet, and Cash Flow Statement) for any listed company. My function is as follows:
def periodic_figure_values(soup, yahoo_figure):
values = []
pattern = re.compile(yahoo_figure)
title = soup.find("strong", text=pattern) # works for the figures printed in bold
if title:
row = title.parent.parent
else:
title = soup.find("td", text=pattern) # works for any other available figure
if title:
row = title.parent
else:
sys.exit("Invalid figure '" + yahoo_figure + "' passed.")
cells = row.find_all("td")[1:] # exclude the <td> with figure name
for cell in cells:
if cell.text.strip() != yahoo_figure: # needed because some figures are indented
str_value = cell.text.strip().replace(",", "").replace("(", "-").replace(")", "")
if str_value == "-":
str_value = 0
value = int(str_value) * 1000
values.append(value)
return values
The yahoo_figure
variable is a string. Obviously this has to be the exact same figure name as is used on Yahoo Finance.
To pass the soup
variable, I use the following function first:
def financials_soup(ticker_symbol, statement="is", quarterly=False):
if statement == "is" or statement == "bs" or statement == "cf":
url = "https://finance.yahoo.com/q/" + statement + "?s=" + ticker_symbol
if not quarterly:
url += "&annual"
return BeautifulSoup(requests.get(url).text, "html.parser")
return sys.exit("Invalid financial statement code '" + statement + "' passed.")
Sample usage -- I want to get the income tax expenses of Apple Inc. from the last available income statements:
print(periodic_figure_values(financials_soup("AAPL", "is"), "Income Tax Expense"))
Output: [19121000000, 13973000000, 13118000000]
You could also get the date of the end of the period from the soup
and create a dictionary where the dates are the keys and the figures are the values, but this would make this post too long.
So far this seems to work for me, but I am always thankful for constructive criticism.
回答1:
This is made a little more difficult because the "Net Income" in enclosed in a <strong>
tag, so bear with me, but I think this works:
import re, requests
from bs4 import BeautifulSoup
url = 'https://finance.yahoo.com/q/is?s=AAPL&annual'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
pattern = re.compile('Net Income')
title = soup.find('strong', text=pattern)
row = title.parent.parent # yes, yes, I know it's not the prettiest
cells = row.find_all('td')[1:] #exclude the <td> with 'Net Income'
values = [ c.text.strip() for c in cells ]
values
, in this case, will contain the three table cells in that "Net Income" row (and, I might add, can easily be converted to ints - I just liked that they kept the ',' as strings)
In [10]: values
Out[10]: [u'53,394,000', u'39,510,000', u'37,037,000']
When I tested it on Alphabet (GOOG) - it doesn't work because they don't display an Income Statement I believe (https://finance.yahoo.com/q/is?s=GOOG&annual) but when I checked Facebook (FB), the values were returned correctly (https://finance.yahoo.com/q/is?s=FB&annual).
If you wanted to create a more dynamic script, you could use string formatting to format the url with whatever stock symbol you want, like this:
ticker_symbol = 'AAPL' # or 'FB' or any other ticker symbol
url = 'https://finance.yahoo.com/q/is?s={}&annual'.format(ticker_symbol))
来源:https://stackoverflow.com/questions/35439105/scrape-yahoo-finance-income-statement-with-python