R: web scraping yahoo.finance after 2019 change

前端 未结 2 1352
暗喜
暗喜 2020-12-17 05:48

I have been happily web scraping yahoo.finance pages for a long time using code largely borrowed from other stackoverflow answers and it has worked great, however in the las

相关标签:
2条回答
  • 2020-12-17 06:02

    This may seem a little around the houses but I wanted to avoid much of what I suspect is dynamic on the page (e.g. many of the classNames) and provide something that might have a slightly longer shelf-life.

    Your code is failing, in part, because there is no table element housing that data. Instead, you can gather the "rows" of the desired output table using a more stable looking fi-row class attribute. Within each row you can then gather the columns by matching on elements with either title attribute or data-test='fin-col' based on the parent row node.

    I use regex to match on the dates (as these change over time) and combine them with the static two headers to provide the final dataframe headers for output. I limit the regex to a single node's text that I know should contain pattern matches that are only those required dates.


    R:

    library(rvest)
    library(stringr)
    library(magrittr)
    
    page <- read_html('https://finance.yahoo.com/quote/AAPL/financials?p=AAPL')
    nodes <- page %>%html_nodes(".fi-row")
    df = NULL
    
    for(i in nodes){
      r <- list(i %>%html_nodes("[title],[data-test='fin-col']")%>%html_text())
      df <- rbind(df,as.data.frame(matrix(r[[1]], ncol = length(r[[1]]), byrow = TRUE), stringsAsFactors = FALSE))
    }
    
    matches <- str_match_all(page%>%html_node('#Col1-3-Financials-Proxy')%>%html_text(),'\\d{1,2}/\\d{1,2}/\\d{4}')  
    headers <- c('Breakdown','TTM', matches[[1]][,1]) 
    names(df) <- headers
    View(df)
    

    Sample:


    Py:

    import requests, re
    import pandas as pd
    from bs4 import BeautifulSoup as bs
    
    r = requests.get('https://finance.yahoo.com/quote/AAPL/financials?p=AAPL')
    soup = bs(r.content, 'lxml')
    results = []
    
    for row in soup.select('.fi-row'):
        results.append([i.text for i in row.select('[title],[data-test="fin-col"]')])
    
    p = re.compile(r'\d{1,2}/\d{1,2}/\d{4}')
    headers = ['Breakdown','TTM']
    headers.extend(p.findall(soup.select_one('#Col1-3-Financials-Proxy').text))
    df = pd.DataFrame(results, columns = headers)
    print(df)
    
    0 讨论(0)
  • 2020-12-17 06:13

    As mentioned in the comment above, here is an alternative that tries to deal with the different table sizes published. I have worked on this and have had help from a friend.

    library(rvest)
    library(tidyverse)
    
    url <- https://finance.yahoo.com/quote/AAPL/financials?p=AAPL
    
    # Download the data
    raw_table <- read_html(url) %>% html_nodes("div.D\\(tbr\\)")
    
    number_of_columns <- raw_table[1] %>% html_nodes("span") %>% length()
    
    if(number_of_columns > 1){
      # Create empty data frame with the required dimentions
      df <- data.frame(matrix(ncol = number_of_columns, nrow = length(raw_table)),
                          stringsAsFactors = F)
    
      # Fill the table looping through rows
      for (i in 1:length(raw_table)) {
        # Find the row name and set it.
        df[i, 1] <- raw_table[i] %>% html_nodes("div.Ta\\(start\\)") %>% html_text()
        # Now grab the values
        row_values <- raw_table[i] %>% html_nodes("div.Ta\\(end\\)")
        for (j in 1:(number_of_columns - 1)) {
          df[i, j+1] <- row_values[j] %>% html_text()
        }
      }
    view(df)
    
    0 讨论(0)
提交回复
热议问题