Improving a function to get stock news data from google in R

后端未结

关注

 1  1734

I\'ve written a function to grab and parse news data from Google for a given stock symbol, but I\'m sure there are ways it could be improved. For starters, my function returns

相关标签:

1条回答

傲寒

2021-02-01 11:27

Here is a shorter (and probably more efficient) version of your getNews function

  getNews2 <- function(symbol, number){

    # load libraries
    require(XML); require(plyr); require(stringr); require(lubridate);  

    # construct url to news feed rss and encode it correctly
    url.b1 = 'http://www.google.com/finance/company_news?q='
    url    = paste(url.b1, symbol, '&output=rss', "&start=", 1,
               "&num=", number, sep = '')
    url    = URLencode(url)

    # parse xml tree, get item nodes, extract data and return data frame
    doc   = xmlTreeParse(url, useInternalNodes = T);
    nodes = getNodeSet(doc, "//item");
    mydf  = ldply(nodes, as.data.frame(xmlToList))

    # clean up names of data frame
    names(mydf) = str_replace_all(names(mydf), "value\\.", "")

    # convert pubDate to date-time object and convert time zone
    mydf$pubDate = strptime(mydf$pubDate, 
                     format = '%a, %d %b %Y %H:%M:%S', tz = 'GMT')
    mydf$pubDate = with_tz(mydf$pubDate, tz = 'America/New_york')

    # drop guid.text and guid..attrs
    mydf$guid.text = mydf$guid..attrs = NULL

    return(mydf)    
}

Moreover, there might be a bug in your code, as I tried using it for symbol = 'WMT' and it returned an error. I think getNews2 works fine for WMT too. Check it out and let me know if it works for you.

PS. The description column still contains html code. But it should be easy to extract the text from it. I will post an update when I find time

0 讨论(0)