Trying to download Google Trends data but date parameter is ignored?

纵然是瞬间 提交于 2020-01-13 04:29:25

问题


I am trying to download Google Trends data in csv format. For basic queries I have been successful (following a blog post by Christoph Riedl).

Problem: By default trends are returned starting from January 2004. I would prefer it to return trends starting from January 2011. However when I add a date parameter to the url request it is completely ignored. I'm not sure how to overcome this.

The following is code will reproduce the issue.

# Just copy/paste this stuff - these are helper functions
require(RCurl)

# This gets the GALX cookie which we need to pass back with the login form
getGALX <- function(curl) {
  txt = basicTextGatherer()
  curlPerform( url=loginURL, curl=curl, writefunction=txt$update, header=TRUE, ssl.verifypeer=FALSE )

  tmp <- txt$value()

  val <- grep("Cookie: GALX", strsplit(tmp, "\n")[[1]], val = TRUE)
  strsplit(val, "[:=;]")[[1]][3]

  return( strsplit( val, "[:=;]")[[1]][3]) 
}

# Function to perform Google login and get cookies ready
gLogin <- function(username, password) {
  ch <- getCurlHandle()

  ans <- (curlSetOpt(curl = ch,
                     ssl.verifypeer = FALSE,
                     useragent = getOption('HTTPUserAgent', "R"),
                     timeout = 60,         
                     followlocation = TRUE,
                     cookiejar = "./cookies",
                     cookiefile = ""))

  galx <- getGALX(ch)
  authenticatePage <- postForm(authenticateURL, .params=list(Email=username, Passwd=password, GALX=galx, PersistentCookie="yes", continue="http://www.google.com/trends"), curl=ch)

  authenticatePage2 <- getURL("http://www.google.com", curl=ch)

  if(getCurlInfo(ch)$response.code == 200) {
    print("Google login successful!")
  } else {
    print("Google login failed!")
  }
  return(ch)
}

# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

get_interest_over_time <- function(res, clean.col.names = TRUE) {
  # remove all text before "Interest over time" data block begins
  data <- gsub(".*Interest over time", "", res)

  # remove all text after "Interest over time" data block ends
  data <- gsub("\n\n.*", "", data)

  # convert "interest over time" data block into data.frame
  data.df <- read.table(text = data, sep =",", header=TRUE)

  # Split data range into to only end of week date 
  data.df$Week <- gsub(".*\\s-\\s", "", data.df$Week)
  data.df$Week <- as.Date(data.df$Week)

  # clean column names
  if(clean.col.names == TRUE) colnames(data.df) <- gsub("\\.\\..*", "", colnames(data.df))

  # return "interest over time" data.frame
  return(data.df)
}

In your browser, please log into Google (e.g. log into gmail). The in R run the following:

# Username and password
username <- "email@address"
password <- "password"

# Login and Authentication URLs
loginURL     <- "https://accounts.google.com/accounts/ServiceLogin"
authenticateURL <- "https://accounts.google.com/accounts/ServiceLoginAuth"
trendsURL       <- "http://www.google.com/trends/TrendsRepport?"

# Google authentication
ch <- gLogin( username, password )
authenticatePage2 <- getURL("http://www.google.com", curl=ch)

The following successfully returns google trends data since January 2004 (i.e. no date parameter)

res <- getForm(trendsURL, q="ggplot2, ggplot", content=1, export=1, graph="all_csv", curl=ch)
df <- get_interest_over_time(res)
head(df)

        Week ggplot2 ggplot
1 2004-01-10       0      0
2 2004-01-17       0      0
3 2004-01-24       0      0
4 2004-01-31       0      0
5 2004-02-07       0      0
6 2004-02-14       0      0

HOWEVER, adding a date parameter to return trends starting in Jan 2013 is ignored

res <- getForm(trendsURL, q="ggplot2, ggplot", date = "1/2013 11m", content=1, export=1, graph="all_csv", curl=ch)
df <- get_interest_over_time(res)
head(df)

        Week ggplot2 ggplot
1 2004-01-10       0      0
2 2004-01-17       0      0
3 2004-01-24       0      0
4 2004-01-31       0      0
5 2004-02-07       0      0
6 2004-02-14       0      0

NOTE 1: Same thing happens with the cat=category parameter. The above is just easier to show with date.

NOTE 2: As Google rescales the data depending on the start date, this is not a case of simply filtering the data.frame. I'm interested in why the date parameter is ignored.

Thank you kindly for your time.


回答1:


It works if you write only a year:

res <- getForm(trendsURL, q="ggplot2, ggplot", date = "2013", content=1, export=1, graph="all_csv", curl=ch)

But I don't know how to add month and day to the date. Probably, it is because on the GoogleTrends web page you can select time range from the list:

"Past 7 days", "Past 30 days",..., "2013", "2012",...

But if I try date="Past 90 days" it still doesn't work.




回答2:


I have had succes with getting monthly data by using the date specification date="2011-1" (January 2011). I viewed the source behind the page - maybe you can find aswers there.

Please post again if you figure out the date specification.



来源:https://stackoverflow.com/questions/20332243/trying-to-download-google-trends-data-but-date-parameter-is-ignored

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!