R: web scraping yahoo.finance after 2019 change

前端未结

关注

 2  1363

I have been happily web scraping yahoo.finance pages for a long time using code largely borrowed from other stackoverflow answers and it has worked great, however in the las

相关标签:

2条回答

春和景丽

2020-12-17 06:02

This may seem a little around the houses but I wanted to avoid much of what I suspect is dynamic on the page (e.g. many of the classNames) and provide something that might have a slightly longer shelf-life.

Your code is failing, in part, because there is no table element housing that data. Instead, you can gather the "rows" of the desired output table using a more stable looking fi-row class attribute. Within each row you can then gather the columns by matching on elements with either title attribute or data-test='fin-col' based on the parent row node.

I use regex to match on the dates (as these change over time) and combine them with the static two headers to provide the final dataframe headers for output. I limit the regex to a single node's text that I know should contain pattern matches that are only those required dates.

library(rvest)
library(stringr)
library(magrittr)

page <- read_html('https://finance.yahoo.com/quote/AAPL/financials?p=AAPL')
nodes <- page %>%html_nodes(".fi-row")
df = NULL

for(i in nodes){
  r <- list(i %>%html_nodes("[title],[data-test='fin-col']")%>%html_text())
  df <- rbind(df,as.data.frame(matrix(r[[1]], ncol = length(r[[1]]), byrow = TRUE), stringsAsFactors = FALSE))
}

matches <- str_match_all(page%>%html_node('#Col1-3-Financials-Proxy')%>%html_text(),'\\d{1,2}/\\d{1,2}/\\d{4}')  
headers <- c('Breakdown','TTM', matches[[1]][,1]) 
names(df) <- headers
View(df)

Sample:

Py:

import requests, re
import pandas as pd
from bs4 import BeautifulSoup as bs

r = requests.get('https://finance.yahoo.com/quote/AAPL/financials?p=AAPL')
soup = bs(r.content, 'lxml')
results = []

for row in soup.select('.fi-row'):
    results.append([i.text for i in row.select('[title],[data-test="fin-col"]')])

p = re.compile(r'\d{1,2}/\d{1,2}/\d{4}')
headers = ['Breakdown','TTM']
headers.extend(p.findall(soup.select_one('#Col1-3-Financials-Proxy').text))
df = pd.DataFrame(results, columns = headers)
print(df)

0 讨论(0)

夕颜

2020-12-17 06:13

As mentioned in the comment above, here is an alternative that tries to deal with the different table sizes published. I have worked on this and have had help from a friend.

library(rvest)
library(tidyverse)

url <- https://finance.yahoo.com/quote/AAPL/financials?p=AAPL

# Download the data
raw_table <- read_html(url) %>% html_nodes("div.D\\(tbr\\)")

number_of_columns <- raw_table[1] %>% html_nodes("span") %>% length()

if(number_of_columns > 1){
  # Create empty data frame with the required dimentions
  df <- data.frame(matrix(ncol = number_of_columns, nrow = length(raw_table)),
                      stringsAsFactors = F)

  # Fill the table looping through rows
  for (i in 1:length(raw_table)) {
    # Find the row name and set it.
    df[i, 1] <- raw_table[i] %>% html_nodes("div.Ta\\(start\\)") %>% html_text()
    # Now grab the values
    row_values <- raw_table[i] %>% html_nodes("div.Ta\\(end\\)")
    for (j in 1:(number_of_columns - 1)) {
      df[i, j+1] <- row_values[j] %>% html_text()
    }
  }
view(df)

0 讨论(0)