Speed up R script looping through files/folders to check thresholds, calculate averages, and plot

问题

I'm trying to speed up some code in R. I think my looping methods can be replaced (maybe with some form of lapply or using sqldf) but I can't seem to figure out how.

The basic premise is that I have a parent directory with ~50 subdirectories, and each of those subdirectories contains ~200 CSV files (a total of 10,000 CSVs). Each of those CSV files contains ~86,400 lines (data is daily by the second).

The goal of the script is to calculate the mean and stdev for two intervals of time from each file, and then make one summary plot for each subdirectory as follows:

library(timeSeries)
library(ggplot2)

# list subdirectories in parent directory
dir <- list.dirs(path = "/ParentDirectory", full.names = TRUE, recursive = FALSE)
num <- (length(dir))

# iterate through all subdirectories
for (idx in 1:num){  

# declare empty vectors to fill for each subdirectory
  DayVal <- c()
  DayStd <- c()
  NightVal <- c()
  NightStd <- c()
  date <- as.Date(character())

  setwd(dir[idx])
  filenames <- list.files(path=getwd())
  numfiles <- length(filenames)

# for each file in the subdirectory
  for (i in c(1:numfiles)){

    day <- read.csv(filenames[i], sep = ',')
    today <-  as.Date(day$time[1], "%Y-%m-%d")

# setting interval for times of day we care about <- SQL seems like it may be useful here but I couldn't get read.csv.sql to recognize hourly intervals

    nightThreshold <- as.POSIXct(paste(today, "03:00:00"))
    dayThreshold <- as.POSIXct(paste(today, "15:00:00"))
    nightInt <- day[(as.POSIXct(day$time) >= nightThreshold  & as.POSIXct(day$time) <= (nightThreshold + 3600)) , ]
    dayInt <- day[(as.POSIXct(day$time) >= dayThreshold  & as.POSIXct(day$time) <= (dayThreshold + 3600)) , ]

    #check some thresholds in the data for that time period
    if (sum(nightInt$val, na.rm=TRUE) < 5){
      NightMean <- mean(nightInt$val, na.rm =TRUE)
      NightSD <-sd(nightInt$val, na.rm =TRUE)
    } else {
      NightMean <- NA
      NightSD <- NA
    }

    if (sum(dayInt$val, na.rm=TRUE) > 5){
      DayMean <- mean(dayInt$val, na.rm =TRUE)
      DaySD <-sd(dayInt$val, na.rm =TRUE)
    } else {
      DayMean <- NA
      DaySD <- NA
    }

    NightVal <- c(NightVal, NightMean)
    NightStd <- c(NightStd, NightSD)
    DayVal <- c(gsrDayVal, DayMean)
    DayStd <- c(gsrDayStd, DaySD)
    date <-c(date, as.Date(today))
  }

  df<-data.frame(date,DayVal,DayStd,NightVal, NightStd)

# plot for the subdirectory
  p1 <- ggplot() +
    geom_point(data = df, aes(x = date, y = gsrDayVal, color = "Day Average")) +
    geom_point(data = df, aes(x = date, y = gsrDayStd, color = "Day Standard Dev")) +
    geom_point(data = df, aes(x = date, y = gsrNightVal, color = "Night Average")) +
    geom_point(data = df, aes(x = date, y = gsrNightStd, color = "Night Standard Dev")) +
    scale_colour_manual(values = c("steelblue", " turquoise3", "purple3", "violet")) 
}

Thanks very much for any advice you can offer!

回答1:

Consider an SQL database solution as you manage quite a bit of data in flatfiles. A Relational Database Management System (RDMS) can easily handle millions of records, even aggregate as needed using its scalable db engine rather than processing in memory per R. If not for speed and efficiency, databases can provide security, robustness, and organization as the central repository. Even work a script to import each daily csv thereafter directly into database.

Fortunately, practically all RDMS have CSV handlers and can load mulitple files in bulk. Below are open source solutions: SQLite (file level database), MySQL, and PostgreSQL (both server level databases), all of which have corresponding libraries in R. Each example recursively imports a csv file from directory list of files into database table named timeseriesdata (with same named fields/data types as csv files). At the end is one SQL call to import an aggregation of Night and Day interval mean and standard deviation (adjust as needed). The only challenge is designating a file and subdirectory indicator (which may or may not exist in actual data) and appending with csv files (possibly after each iteration, run an update query to a FileID column).

dir <- list.dirs(path = "/ParentDirectory", 
                 full.names = TRUE, recursive = FALSE)

# SQLITE DATABASE    
library(RSQLite)
sqconn <- dbConnect(RSQLite::SQLite(), dbname = "/path/to/database.db")
# (CONNECTION NOT NEEDED DUE TO CMD LINE LOAD BELOW)

for (d in dir){
  filenames <- list.files(d)

  for (f in filenames){  

    csvfile <- paste0(d, '/', f)            
    # IMPORT VIA COMMAND LINE OR BASH (ASSUMES SQLITE3 IS PATH VARIABLE)
    cmd <- paste0("(echo .separator ,; echo .import ' ", csvfile , " ' timeseriesdata ')",
                  " '| sqlite3 ' /path/to/databasename.db")    
    system(cmd)

  }
}
# CLOSE CONNNECTION
dbDisconnect(sqconn)


# MYSQL DATABASE    
library(RMySQL)
myconn <- dbConnect(RMySQL::MySQL(), dbname="databasename", host="hostname",
                    username="username", password="***")

for (d in dir){
  filenames <- list.files(d)

  for (f in filenames){  

      csvfile <- paste0(d, '/', f)  
      # IMPORT USING LOAD DATA INFILE COMMAND  
      sql <- paste0("LOAD DATA INFILE '", csvfile, "'
                     INTO TABLE timeseriesdata                  
                     FIELDS TERMINATED BY ','                 
                     ENCLOSED BY '\"'
                     ESCAPED BY '\"' 
                     LINES TERMINATED BY '\\n'
                     IGNORE 1 LINES
                     (col1, col2, col3, col4, col5);")
      dbSendQuery(myconn, sql)
      dbCommit(myconn)    
  }
}
# CLOSE CONNECTION
dbDisconnect(myconn)


# POSTGRESQL DATABASE
library(RPostgreSQL)
pgconn <- dbConnect(PostgreSQL(), dbname="databasename", host="myhost", 
                    user= "postgres", password="***")

for (d in dir){
  filenames <- list.files(d)

  for (f in filenames){  

    csvfile <- paste0(d, '/', f)
    # IMPORT USING COPY COMMAND
    sql <- paste("COPY timeseriesdata(col1, col2, col3, col4, col5) 
                  FROM '", csvfile , "' DELIMITER ',' CSV;")
    dbSendQuery(pgconn, sql)

  }
}
# CLOSE CONNNECTION
dbDisconnect(pgconn)


# CREATE PLOT DATA FRAME (MYSQL EXAMPLE)
#   (ADD INSIDE SUBDIRECTORY LOOP OR INCLUDE SUBDIR COLUMN IN GROUP BY)
library(RMySQL)
myconn <- dbConnect(RMySQL::MySQL(), dbname="databasename", host="hostname",
                    username="username", password="***")

# AGGREGATE QUERY USING TWO DERIVED TABLE SUBQUERIES 
#   (FOR NIGHT AND DAY, ADJUST FILTERS PER NEEDS)
strSQL <- "SELECT dt.FileID, NightMean, NightSTD, DayMean, DaySTD
           FROM
              (SELECT nt.FileID, Avg(nt.time) As NightMean, STDDEV(nt.time) As NightSTD
               FROM timeseriesdata nt
               WHERE nt.time >= '15:00:00' AND nt.time <= '21:00:00'
               GROUP BY nt.FileID
               HAVING Sum(nt.val) < 5) AS ng
           INNER JOIN 
              (SELECT dt.FileID, Avg(dt.time) As DayMean, STDDEV(dt.time) As DaySTD
               FROM timeseriesdata dt     
               WHERE dt.time >= '03:00:00' AND dt.time <= '09:00:00'               
               GROUP BY dt.FileID
               HAVING Sum(dt.val) > 5) AS dy
           ON ng.FileID = dy.FileID;" 
df <- dbSendQuery(myconn, strSQL)
dbFetch(df)

dbDisconnect(myconn)

回答2:

One thing would be to do the conversion of day$time once instead of all the times you are doing it now. Also use the lubridate package because if you have a large number of times to convert, it is much faster than 'as.POSIXct'.

Also size the variables you are storing results in, e.g., DayVal, DayStd, to the approriate size (DayVal <- numeric(num)) and then index the result into the appropriate index.

If the CSV files are large, consider using the 'fread' function in data.table package.

来源：https://stackoverflow.com/questions/34191127/speed-up-r-script-looping-through-files-folders-to-check-thresholds-calculate-a

标签

for-loop

lapply

read.csv