问题
I am interested in analyzing the balance, income and cash flow statements from Yahoo Finance for multiple tickers using R.
I have seen that there are R packages that pull information from Yahoo Finance, but all the examples I have seen concern historical stock price information. Is there a way I can pull historical information from these statements using R?
For example, for Apple (AAPL) the retrievable links are as follows:
- https://finance.yahoo.com/quote/AAPL/financials?p=AAPL
- https://finance.yahoo.com/quote/AAPL/balance-sheet?p=AAPL
- https://finance.yahoo.com/quote/AAPL/cash-flow?p=AAPL
In essence, the goal is to create three data-frames (AAPL_cashflow
, AAPL_income
& AAPL_balance
) that consists the same pattern as on the website. Each row is identified by the type of financial and the columns are the date.
Does anybody have any experience with parsing and scraping tables? I think that rvest
can help with this, right?
Thanks in advance!
回答1:
With a handful of packages from the tidyverse
, this should get you started:
library(tidyverse)
library(rvest)
"https://finance.yahoo.com/quote/AAPL/financials?p=AAPL" %>%
read_html() %>%
html_table() %>%
map_df(bind_cols) %>%
as_tibble()
# A tibble: 28 x 5 X1 X2 X3 X4 X5 <chr> <chr> <chr> <chr> <chr> 1 Revenue 9/30/2017 9/24/2016 9/26/2015 9/27/20… 2 Total Revenue 229,234,000 215,639,000 233,715,000 182,795… 3 Cost of Revenue 141,048,000 131,376,000 140,089,000 112,258… 4 Gross Profit 88,186,000 84,263,000 93,626,000 70,537,… 5 Operating Expenses Operating Expenses Operating Expenses Operating Expenses Operati… 6 Research Development 11,581,000 10,045,000 8,067,000 6,041,0… 7 Selling General and Administrative 15,261,000 14,194,000 14,329,000 11,993,… 8 Non Recurring - - - - 9 Others - - - - 10 Total Operating Expenses 167,890,000 155,615,000 162,485,000 130,292… # ... with 18 more rows
Note that if you want to take the first row and treat it as a column name, add header = TRUE
to the html_table
call. This will give you the dates as column names in the finances
data frame for example.
Additionally, there are multiple tables inside this data frame so you will need to reshape it in order to play with the data. For example, var X2
through X5
are currently character when they should be numeric type.
One example might be:
finances <- "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL" %>%
read_html() %>%
html_table(header = TRUE) %>%
map_df(bind_cols) %>%
as_tibble()
finances %>%
mutate_all(funs(str_replace_all(., ",", ""))) %>%
mutate_all(funs(str_replace(., "-", NA_character_))) %>%
mutate_at(vars(-Revenue), funs(str_remove_all(., "[a-zA-Z]"))) %>%
mutate_at(vars(-Revenue), funs(as.numeric)) %>%
drop_na()
# A tibble: 14 x 5 Revenue `9/30/2017` `9/24/2016` `9/26/2015` `9/27/2014` <chr> <dbl> <dbl> <dbl> <dbl> 1 Total Revenue 229234000. 215639000. 233715000. 182795000. 2 Cost of Revenue 141048000. 131376000. 140089000. 112258000. 3 Gross Profit 88186000. 84263000. 93626000. 70537000. 4 Research Development 11581000. 10045000. 8067000. 6041000. 5 Selling General and Administrative 15261000. 14194000. 14329000. 11993000. 6 Total Operating Expenses 167890000. 155615000. 162485000. 130292000. 7 Operating Income or Loss 61344000. 60024000. 71230000. 52503000. 8 Total Other Income/Expenses Net 2745000. 1348000. 1285000. 980000. 9 Earnings Before Interest and Taxes 61344000. 60024000. 71230000. 52503000. 10 Income Before Tax 64089000. 61372000. 72515000. 53483000. 11 Income Tax Expense 15738000. 15685000. 19121000. 13973000. 12 Net Income From Continuing Ops 48351000. 45687000. 53394000. 39510000. 13 Net Income 48351000. 45687000. 53394000. 39510000. 14 Net Income Applicable To Common Shares 48351000. 45687000. 53394000. 39510000.
We could go a step further and make the data frame more "tidy" using gather
:
finances %>%
mutate_all(funs(str_replace_all(., ",", ""))) %>%
mutate_all(funs(str_replace(., "-", NA_character_))) %>%
mutate_at(vars(-Revenue), funs(str_remove_all(., "[a-zA-Z]"))) %>%
mutate_at(vars(-Revenue), funs(as.numeric)) %>%
drop_na() %>%
gather(key = "date", value, -Revenue) %>%
mutate(date = lubridate::mdy(date)) %>%
rename("var" = Revenue) %>%
as_tibble()
# A tibble: 56 x 3 var date value <chr> <date> <dbl> 1 Total Revenue 2017-09-30 229234000. 2 Cost of Revenue 2017-09-30 141048000. 3 Gross Profit 2017-09-30 88186000. 4 Research Development 2017-09-30 11581000. 5 Selling General and Administrative 2017-09-30 15261000. 6 Total Operating Expenses 2017-09-30 167890000. 7 Operating Income or Loss 2017-09-30 61344000. 8 Total Other Income/Expenses Net 2017-09-30 2745000. 9 Earnings Before Interest and Taxes 2017-09-30 61344000. 10 Income Before Tax 2017-09-30 64089000. # ... with 46 more rows
来源:https://stackoverflow.com/questions/49845173/how-to-use-r-to-scrape-financials-from-yahoo-finance