问题
Actual
I have been using the RSiteCatalyst package for a while right now. For those who do not know it, it makes the process of obtaining data from Adobe Analytics over the API easier.
Until now, the workflow was as follow:
- Make a request, for instance:
key_metrics <- QueueOvertime(clientId, dateFrom4, dateTo,
metrics = c("pageviews"), date.granularity = "month",
max.attempts = 500, interval.seconds = 20)
Wait for the response which will be saved as a data.frame (example structure):
> View(head(key_metrics,1)) datetime name year month day pageviews 1 2015-07-01 July 2015 2015 7 1 45825
Do some data transformation (for example:
key_metrics$datetime <- as.Date(key_metrics$datetime)
The problem with this workflow is that sometimes (because of request complexity), we can wait a lot of time until the response finally comes. If the R script contains 40-50 API requests which are same complex, that means that we will be waiting 40-50 times until data finally comes and we can do a new request. This is clearly generating a bootleneck in my ETL process.
Target
There is however a parameter enqueueOnly
in most of the functions of the package, that tells Adobe to process the request while delivering a report Id as response:
key_metrics <- QueueOvertime(clientId, dateFrom4, dateTo,
metrics = c("pageviews"), date.granularity = "month",
max.attempts = 500, interval.seconds = 20,
enqueueOnly = TRUE)
> key_metrics
[1] 1154642436
I can obtain the "real" response (this with data) anytime by using following function:
key_metrics <- GetReport(key_metrics)
In each request I am adding the parameter enqueueOnly = TRUE
while generating a list of Report Ids and Report Names:
queueFromIds <- c(queueFromIds, key_metrics)
queueFromNames <- c(queueFromNames, "key_metrics")
The most important difference with this approach is that all my requestes are being processed by Adobe at the same time, and therefore the waiting time is considerably decreased.
Problem
I am having, however, problems by obtaining the data efficiently. I am trying with a while
loop that removes the key ID and key Name from the previous vectors once data is obtained:
while (length(queueFromNames)>0)
{
assign(queueFromNames[1], GetReport(queueFromIds[1],
max.attempts = 3,
interval.seconds = 5))
queueFromNames <- queueFromNames[-1]
queueFromIds <- queueFromIds[-1]
}
However, this only works as long as the requests are simple enough to be processed in seconds. When the request is complex enough to not be processed in 3 attempts with an interval of 5 seconds, the loop stops with following error:
Error in ApiRequest(body = toJSON(request.body), func.name = "Report.Get", : ERROR: max attempts exceeded for https://api3.omniture.com/admin/1.4/rest/?method=Report.Get
Which functions may help me to control that all the API requests are being correctly processed, and, in the best scenario, API requests that need an extra time (they generate an error) are skipped until the end of the loop, when they are again requested?
回答1:
I use a couple of functions to generate/retrieve the report IDs independently. This way, it doesn't matter how long it takes the reports to be processed. I usually come back for them 12 hours after the report IDs were generated. I think they expire after 48 hours or so. These functions rely on RSiteCatalyst of course. Here are the functions:
#' Generate report IDs to be retrieved later
#'
#' @description This function works in tandem with other functions to programatically extract big datasets from Adobe Analytics.
#' @param suite Report suite ID.
#' @param dateBegin Start date in the following format: YYYY-MM-DD.
#' @param dateFinish End date in the following format: YYYY-MM-DD.
#' @param metrics Vector containing up to 30 required metrics IDs.
#' @param elements Vector containing element IDs.
#' @param classification Vector containing classification IDs.
#'@param valueStart Integer value pointing to row to start report with.
#' @return A data frame containing all the report IDs per day. They are required to obtain all trended reports during the specified time frame.
#' @examples
#' \dontrun{
#' ReportsIDs <- reportsGenerator(suite,dateBegin,dateFinish,metrics, elements,classification)
#'}
#' @export
reportsGenerator <- function(suite,
dateBegin,
dateFinish,
metrics,
elements,
classification,
valueStart) {
#Convert dates to date format.
#Deduct one from dateBegin to
#neutralize the initial +1 in the loop.
dateBegin <- as.Date(dateBegin, "%Y-%m-%d") - 1
dateFinish <- as.Date(dateFinish, "%Y-%m-%d")
timeRange <- dateFinish - dateBegin
#Create data frame to store dates and report IDs
VisitorActivityReports <-
data.frame(matrix(NA, nrow = timeRange, ncol = 2))
names(VisitorActivityReports) <- c("Date", "ReportID")
#Run a loop to retrieve one ReportID for each day in the time period.
for (i in 1:timeRange) {
dailyDate <- as.character(dateBegin + i)
print(i) #Visibility to end user
print(dailyDate) #Visibility to end user
VisitorActivityReports[i, 1] <- dailyDate
VisitorActivityReports[i, 2] <-
RSiteCatalyst::QueueTrended(
reportsuite.id = suite,
date.from = dailyDate,
date.to = dailyDate,
metrics = metrics,
elements = elements,
classification = classification,
top = 50000,
max.attempts = 500,
start = valueStart,
enqueueOnly = T
)
}
return(VisitorActivityReports)
}
You should assign the output of the previous function to a variable. Then use that variable as the input of the following function. Also assign the result of reportsRetriever to a variable. The output will be a dataframe. The function will rbind all the reports together as long as they all share the same structure. Don't try to concatenate reports with different structure.
#' Retrieve all reports stored as output of reportsGenerator function and consolidate them.
#'
#' @param dataFrameReports This is the output from reportsGenerator function. It MUST contain a column titled: ReportID
#' @details It is recommended to break the input data frame in chunks of 50 rows in order to prevent memory issues if the reports are too large. Otherwise the server or local computer might run out of memory.
#' @return A data frame containing all the consolidated reports defined by the reportsGenerator function.
#' @examples
#' \dontrun{
#' visitorActivity <- reportsRetriever(dataFrameReports)
#'}
#'
#' @export
reportsRetriever <- function(dataFrameReports) {
visitor.activity.list <- lapply(dataFrameReports$ReportID, tryCatch(GetReport))
visitor.activity.df <- as.data.frame(do.call(rbind, visitor.activity.list))
#Validate report integrity
if (identical(as.character(unique(visitor.activity.df$datetime)), dataFrameReports$Date)) {
print("Ok. All reports available")
return(visitor.activity.df)
} else {
print("Some reports may have been missed.")
missingReportsIndex <- !(as.character(unique(visitor.activity.df$datetime)) %in% dataFrameReports$Date)
return(visitor.activity.df)
}
}
来源:https://stackoverflow.com/questions/46276766/r-3-4-1-intelligent-use-of-while-loop-for-rsitecatalyst-enqueued-reports