Handling paginated SQL query results

让人想犯罪 __ 提交于 2019-12-25 06:23:08

问题


For my dissertation data collection, one of the sources is an externally-managed system, which is based on Web form for submitting SQL queries. Using R and RCurl, I have implemented an automated data collection framework, where I simulate the above-mentioned form. Everything worked well while I was limiting the size of the resulting dataset. But, when I tried to go over 100000 records (RQ_SIZE in the code below), the tandem "my code - their system" started being unresponsive ("hanging").

So, I have decided to use SQL pagination feature (LIMIT ... OFFSET ...) to submit a series of requests, hoping then to combine the paginated results into a target data frame. However, after changing my code accordingly, the output that I see is only one pagination progress character (*) and then no more output. I'd appreciate, if you could help me identify the probable cause of the unexpected behavior. I cannot provide reproducible example, as it's very difficult to extract the functionality, not to mention the data, but I hope that the following code snippet would be enough to reveal the issue (or, at least, a direction toward the problem).

# First, retrieve total number of rows for the request
srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where,
                DATA_SEP, ADD_SQL)
assign(dataName, srdaGetData()) # retrieve result
data <- get(dataName)
numRequests <- as.numeric(data) %/% RQ_SIZE + 1

# Now, we can request & retrieve data via SQL pagination
for (i in 1:numRequests) {

  # setup SQL pagination
  if (rq$where == '') rq$where <- '1=1'
  rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*(i-1))

  # Submit data request
  srdaRequestData(queryURL, rq$select, rq$from, rq$where,
                  DATA_SEP, ADD_SQL)
  assign(dataName, srdaGetData()) # retrieve result
  data <- get(dataName)

  # some code

  # add current data frame to the list
  dfList <- c(dfList, data)
  if (DEBUG) message("*", appendLF = FALSE)
}

# merge all the result pages' data frames
data <- do.call("rbind", dfList)

# save current data frame to RDS file
saveRDS(data, rdataFile)

回答1:


It probably falls into the category when presumably MySQL hinders LIMIT OFFSET: Why does MYSQL higher LIMIT offset slow the query down?

Overall, fetching large data sets over HTTP repeatedly is not very reliable.




回答2:


Since this is for your dissertation, here is a hand:

## Folder were to save the results to disk.
##  Ideally, use a new, empty folder. Easier then to load from disk
folder.out <- "~/mydissertation/sql_data_scrape/"
## Create the folder if not exist. 
dir.create(folder.out, showWarnings=FALSE, recursive=TRUE)


## The larger this number, the more memory you will require. 
## If you are renting a large box on, say, EC2, then you can make this 100, or so
NumberOfOffsetsBetweenSaves <- 10

## The limit size per request
RQ_SIZE <- 1000

# First, retrieve total number of rows for the request
srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where,
                DATA_SEP, ADD_SQL)


## Get the total number of rows
TotalRows <- as.numeric(srdaGetData())

TotalNumberOfRequests <- TotalRows %/% RQ_SIZE

TotalNumberOfGroups <- TotalNumberOfRequests %/% NumberOfOffsetsBetweenSaves + 1

## FYI: Total number of rows being requested is
##  (NumberOfOffsetsBetweenSaves * RQ_SIZE * TotalNumberOfGroups) 


for (g in seq(TotalNumberOfGroups)) {

  ret <- 
    lapply(seq(NumberOfOffsetsBetweenSaves), function(i) {

      ## function(i) is the same code you have
      ##    inside your for loop, but cleaned up.

      # setup SQL pagination
      if (rq$where == '') 
          rq$where <- '1=1'

      rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*g*(i-1))

      # Submit data request
      srdaRequestData(queryURL, rq$select, rq$from, rq$where,
                      DATA_SEP, ADD_SQL)

       # retrieve result
      data <- srdaGetData()

      # some code

      if (DEBUG) message("*", appendLF = FALSE)    


      ### DONT ASSIGN TO dfList, JUST RETURN `data`
      # xxxxxx DONT DO: xxxxx dfList <- c(dfList, data)
      ### INSTEAD:

      ## return
      data
  })

  ## save each iteration
  file.out <- sprintf("%s/data_scrape_%04i.RDS", folder.out, g)
  saveRDS(do.call(rbind, ret), file=file.out)

  ## OPTIONAL (this will be slower, but will keep your rams and goats in line)
  #    rm(ret)
  #    gc()
}

Then, once you are done scraping:

library(data.table)

folder.out <- "~/mydissertation/sql_data_scrape/"

files <- dir(folder.out, full=TRUE, pattern="\\.RDS$") 

## Create an empty list
myData <- vector("list", length=length(files))


## Option 1, using data.frame
    for (i in seq(myData))
      myData[[i]] <- readRDS(files[[i]])

    DT <- do.call(rbind, myData)

## Option 2, using data.table
    for (i in seq(myData))
      myData[[i]] <- as.data.table(readRDS(files[[i]]))

    DT <- rbindlist(myData)



回答3:


I'm answering my own question, as, finally, I have figured out what has been the real source of the problem. My investigation revealed that the unexpected waiting state of the program was due to PostgreSQL becoming confused by malformed SQL queries, which contained multiple LIMIT and OFFSET keywords.

The reason of that is pretty simple: I used rq$where both outside and inside the for loop, which made paste() concatenate previous iteration's WHERE clause with the current one. I have fixed the code by processing contents of the WHERE clause and saving it before the loop and then using the saved value in each iteration of the loop safely, as it became independent from the value of the original WHERE clause.

This investigation also helped me to fix some other deficiencies in my code and make improvements (such as using sub-selects to properly handle SQL queries returning number of records for queries with aggregate functions). The moral of the story: you can never be too careful in software development. Big thank you to those nice people who helped with this question.



来源:https://stackoverflow.com/questions/24264817/handling-paginated-sql-query-results

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!