I\'ve written a function to grab and parse news data from Google for a given stock symbol, but I\'m sure there are ways it could be improved. For starters, my function returns
Here is a shorter (and probably more efficient) version of your getNews
function
getNews2 <- function(symbol, number){
# load libraries
require(XML); require(plyr); require(stringr); require(lubridate);
# construct url to news feed rss and encode it correctly
url.b1 = 'http://www.google.com/finance/company_news?q='
url = paste(url.b1, symbol, '&output=rss', "&start=", 1,
"&num=", number, sep = '')
url = URLencode(url)
# parse xml tree, get item nodes, extract data and return data frame
doc = xmlTreeParse(url, useInternalNodes = T);
nodes = getNodeSet(doc, "//item");
mydf = ldply(nodes, as.data.frame(xmlToList))
# clean up names of data frame
names(mydf) = str_replace_all(names(mydf), "value\\.", "")
# convert pubDate to date-time object and convert time zone
mydf$pubDate = strptime(mydf$pubDate,
format = '%a, %d %b %Y %H:%M:%S', tz = 'GMT')
mydf$pubDate = with_tz(mydf$pubDate, tz = 'America/New_york')
# drop guid.text and guid..attrs
mydf$guid.text = mydf$guid..attrs = NULL
return(mydf)
}
Moreover, there might be a bug in your code, as I tried using it for symbol = 'WMT'
and it returned an error. I think getNews2
works fine for WMT too. Check it out and let me know if it works for you.
PS. The description
column still contains html code. But it should be easy to extract the text from it. I will post an update when I find time