How to use R to scrape financials from Yahoo Finance

删除回忆录丶 提交于 2020-01-23 02:44:38

问题


I am interested in analyzing the balance, income and cash flow statements from Yahoo Finance for multiple tickers using R.

I have seen that there are R packages that pull information from Yahoo Finance, but all the examples I have seen concern historical stock price information. Is there a way I can pull historical information from these statements using R?

For example, for Apple (AAPL) the retrievable links are as follows:

  • https://finance.yahoo.com/quote/AAPL/financials?p=AAPL
  • https://finance.yahoo.com/quote/AAPL/balance-sheet?p=AAPL
  • https://finance.yahoo.com/quote/AAPL/cash-flow?p=AAPL

In essence, the goal is to create three data-frames (AAPL_cashflow, AAPL_income & AAPL_balance) that consists the same pattern as on the website. Each row is identified by the type of financial and the columns are the date.

Does anybody have any experience with parsing and scraping tables? I think that rvest can help with this, right?

Thanks in advance!


回答1:


With a handful of packages from the tidyverse, this should get you started:

library(tidyverse)
library(rvest)

"https://finance.yahoo.com/quote/AAPL/financials?p=AAPL" %>% 
  read_html() %>% 
  html_table() %>% 
  map_df(bind_cols) %>% 
  as_tibble()
# A tibble: 28 x 5
   X1                                 X2                 X3                 X4                 X5      
   <chr>                              <chr>              <chr>              <chr>              <chr>   
 1 Revenue                            9/30/2017          9/24/2016          9/26/2015          9/27/20…
 2 Total Revenue                      229,234,000        215,639,000        233,715,000        182,795…
 3 Cost of Revenue                    141,048,000        131,376,000        140,089,000        112,258…
 4 Gross Profit                       88,186,000         84,263,000         93,626,000         70,537,…
 5 Operating Expenses                 Operating Expenses Operating Expenses Operating Expenses Operati…
 6 Research Development               11,581,000         10,045,000         8,067,000          6,041,0…
 7 Selling General and Administrative 15,261,000         14,194,000         14,329,000         11,993,…
 8 Non Recurring                      -                  -                  -                  -       
 9 Others                             -                  -                  -                  -       
10 Total Operating Expenses           167,890,000        155,615,000        162,485,000        130,292…
# ... with 18 more rows

Note that if you want to take the first row and treat it as a column name, add header = TRUE to the html_table call. This will give you the dates as column names in the finances data frame for example.

Additionally, there are multiple tables inside this data frame so you will need to reshape it in order to play with the data. For example, var X2 through X5 are currently character when they should be numeric type.

One example might be:

finances <- "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL" %>% 
  read_html() %>% 
  html_table(header = TRUE) %>% 
  map_df(bind_cols) %>% 
  as_tibble()

finances %>% 
  mutate_all(funs(str_replace_all(., ",", ""))) %>% 
  mutate_all(funs(str_replace(., "-", NA_character_))) %>%
  mutate_at(vars(-Revenue), funs(str_remove_all(., "[a-zA-Z]"))) %>% 
  mutate_at(vars(-Revenue), funs(as.numeric)) %>% 
  drop_na()
# A tibble: 14 x 5
   Revenue                                `9/30/2017` `9/24/2016` `9/26/2015` `9/27/2014`
   <chr>                                        <dbl>       <dbl>       <dbl>       <dbl>
 1 Total Revenue                           229234000.  215639000.  233715000.  182795000.
 2 Cost of Revenue                         141048000.  131376000.  140089000.  112258000.
 3 Gross Profit                             88186000.   84263000.   93626000.   70537000.
 4 Research Development                     11581000.   10045000.    8067000.    6041000.
 5 Selling General and Administrative       15261000.   14194000.   14329000.   11993000.
 6 Total Operating Expenses                167890000.  155615000.  162485000.  130292000.
 7 Operating Income or Loss                 61344000.   60024000.   71230000.   52503000.
 8 Total Other Income/Expenses Net           2745000.    1348000.    1285000.     980000.
 9 Earnings Before Interest and Taxes       61344000.   60024000.   71230000.   52503000.
10 Income Before Tax                        64089000.   61372000.   72515000.   53483000.
11 Income Tax Expense                       15738000.   15685000.   19121000.   13973000.
12 Net Income From Continuing Ops           48351000.   45687000.   53394000.   39510000.
13 Net Income                               48351000.   45687000.   53394000.   39510000.
14 Net Income Applicable To Common Shares   48351000.   45687000.   53394000.   39510000.

We could go a step further and make the data frame more "tidy" using gather:

finances %>% 
  mutate_all(funs(str_replace_all(., ",", ""))) %>% 
  mutate_all(funs(str_replace(., "-", NA_character_))) %>%
  mutate_at(vars(-Revenue), funs(str_remove_all(., "[a-zA-Z]"))) %>% 
  mutate_at(vars(-Revenue), funs(as.numeric)) %>% 
  drop_na() %>% 
  gather(key = "date", value, -Revenue) %>% 
  mutate(date = lubridate::mdy(date)) %>% 
  rename("var" = Revenue) %>% 
  as_tibble()
# A tibble: 56 x 3
   var                                date            value
   <chr>                              <date>          <dbl>
 1 Total Revenue                      2017-09-30 229234000.
 2 Cost of Revenue                    2017-09-30 141048000.
 3 Gross Profit                       2017-09-30  88186000.
 4 Research Development               2017-09-30  11581000.
 5 Selling General and Administrative 2017-09-30  15261000.
 6 Total Operating Expenses           2017-09-30 167890000.
 7 Operating Income or Loss           2017-09-30  61344000.
 8 Total Other Income/Expenses Net    2017-09-30   2745000.
 9 Earnings Before Interest and Taxes 2017-09-30  61344000.
10 Income Before Tax                  2017-09-30  64089000.
# ... with 46 more rows


来源:https://stackoverflow.com/questions/49845173/how-to-use-r-to-scrape-financials-from-yahoo-finance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!