Improving a function to get stock news data from google in R

后端 未结 1 1732
广开言路
广开言路 2021-02-01 11:18

I\'ve written a function to grab and parse news data from Google for a given stock symbol, but I\'m sure there are ways it could be improved. For starters, my function returns

1条回答
  •  傲寒
    傲寒 (楼主)
    2021-02-01 11:27

    Here is a shorter (and probably more efficient) version of your getNews function

      getNews2 <- function(symbol, number){
    
        # load libraries
        require(XML); require(plyr); require(stringr); require(lubridate);  
    
        # construct url to news feed rss and encode it correctly
        url.b1 = 'http://www.google.com/finance/company_news?q='
        url    = paste(url.b1, symbol, '&output=rss', "&start=", 1,
                   "&num=", number, sep = '')
        url    = URLencode(url)
    
        # parse xml tree, get item nodes, extract data and return data frame
        doc   = xmlTreeParse(url, useInternalNodes = T);
        nodes = getNodeSet(doc, "//item");
        mydf  = ldply(nodes, as.data.frame(xmlToList))
    
        # clean up names of data frame
        names(mydf) = str_replace_all(names(mydf), "value\\.", "")
    
        # convert pubDate to date-time object and convert time zone
        mydf$pubDate = strptime(mydf$pubDate, 
                         format = '%a, %d %b %Y %H:%M:%S', tz = 'GMT')
        mydf$pubDate = with_tz(mydf$pubDate, tz = 'America/New_york')
    
        # drop guid.text and guid..attrs
        mydf$guid.text = mydf$guid..attrs = NULL
    
        return(mydf)    
    }
    

    Moreover, there might be a bug in your code, as I tried using it for symbol = 'WMT' and it returned an error. I think getNews2 works fine for WMT too. Check it out and let me know if it works for you.

    PS. The description column still contains html code. But it should be easy to extract the text from it. I will post an update when I find time

    0 讨论(0)
提交回复
热议问题