Import newest csv file in directory

问题

Goal:
- Import the newest file (.csv) from a local directory into R

Goal Details:
- A csv file is uploaded to a folder daily on my Mac. I would like to be able to incorporate a function in my R script that automatically imports the newest file into my workspace for further analysis. The file is uploaded daily around 4:30AM
- I would like this function to be run in the morning (no earlier than 6AM so there's plenty of time for leeway here)

Input Details:
- file type: .csv
- naming convention: example file name: "28 Jul 2014 04:37:47 -0400.csv"
- frequency: daily import @ ~ 04:30

What I've Tried:
- I know this may seem like a weak attempt but I'm really at a loss on how to amend this function below.
- My thought on paper is to 'grab' the id of the newest file, than paste() it in front of the directory name, then viola! (but alas my programming skills are lacking to code this here)
- The code below is what tried to run but it just 'hangs' and doesn't finish. I got this code from this R forum found here

Code:

lastChange = file.info(directory)$mtime 
while(TRUE){ 
  currentM = file.info(directory)$mtime 
  if(currentM != lastChange){ 
    lastChange = currentM 
    read.csv(directory) 
  } 
  # try again in 10 minutes 
  Sys.sleep(600) 
}

My Environment:
- R 3.1
- Mac OS X 10.9.4 (Mavericks)

Thank you so much in advance for any help! :-)

回答1:

The following function uses a timestamp file to "keep track" of files that have been processed with the use of a timestamp file. It can be run either continually in an R instance (as you first suggested), or by way of single-run instances, lending to @andrew's suggestion of a cron job. (The cat() command is included primarily for testing; feel free to remove it.)

processDir <- function(directory = '.', pattern = '*.csv', loop = FALSE, delay = 600,
                       stampFile = file.path(directory, '.csvProcessor')) {
    if (! file.exists(stampFile))
        file.create(stampFile)
    firstRun <- TRUE
    while (firstRun || loop) {
        firstRun <- FALSE
        stampTime <- file.info(stampFile)$mtime
        allFilesDF <- file.info(list.files(path = directory, pattern = pattern,
                                           full.names = TRUE, no.. = TRUE))
        unprocessedFiles <- allFilesDF[(! allFilesDF$isdir) &
                                       (allFilesDF$mtime > stampTime), ]
        if (nrow(unprocessedFiles)) {
            ## We need to update the timestamp on stampFile quickly so
            ## that files added while this is running will be found in the
            ## next loop.
            ## WARNING: this blindly truncates the stampFile.
            file.create(stampFile, showWarnings = FALSE)
            for (fn in rownames(unprocessedFiles)) {
                cat('Processing ', fn, '\n')
                ## read.csv(fn)
                ## ...
            }
        }
        if (loop) Sys.sleep(delay)
    }
}

As you initially suggested, running it in a continually-running R instance would simply be:

processDir(loop = TRUE)

To use @andrew's suggestion of a cron job, append the following line after the function definition:

processDir()

... and use a crontab file similar to the following:

# crontab
0 8 * * * path/to/Rscript path/to/processDir.R

Hope this helps.

回答2:

-- readfile.R --

files <- file.info(list.files(directory))
read.csv(rownames(files)[order(files$mtime)][nrow(files)])

I'd put the above script in a cron job that runs every morning at a time when the file for the day will have been written. The below crontab runs it every morning at 8am.

-- in crontab --

0 8 * * *  Rscript readfile.R