问题
I'm trying to speed up some code in R. I think my looping methods can be replaced (maybe with some form of lapply or using sqldf) but I can't seem to figure out how.
The basic premise is that I have a parent directory with ~50 subdirectories, and each of those subdirectories contains ~200 CSV files (a total of 10,000 CSVs). Each of those CSV files contains ~86,400 lines (data is daily by the second).
The goal of the script is to calculate the mean and stdev for two intervals of time from each file, and then make one summary plot for each subdirectory as follows:
library(timeSeries)
library(ggplot2)
# list subdirectories in parent directory
dir <- list.dirs(path = "/ParentDirectory", full.names = TRUE, recursive = FALSE)
num <- (length(dir))
# iterate through all subdirectories
for (idx in 1:num){
# declare empty vectors to fill for each subdirectory
DayVal <- c()
DayStd <- c()
NightVal <- c()
NightStd <- c()
date <- as.Date(character())
setwd(dir[idx])
filenames <- list.files(path=getwd())
numfiles <- length(filenames)
# for each file in the subdirectory
for (i in c(1:numfiles)){
day <- read.csv(filenames[i], sep = ',')
today <- as.Date(day$time[1], "%Y-%m-%d")
# setting interval for times of day we care about <- SQL seems like it may be useful here but I couldn't get read.csv.sql to recognize hourly intervals
nightThreshold <- as.POSIXct(paste(today, "03:00:00"))
dayThreshold <- as.POSIXct(paste(today, "15:00:00"))
nightInt <- day[(as.POSIXct(day$time) >= nightThreshold & as.POSIXct(day$time) <= (nightThreshold + 3600)) , ]
dayInt <- day[(as.POSIXct(day$time) >= dayThreshold & as.POSIXct(day$time) <= (dayThreshold + 3600)) , ]
#check some thresholds in the data for that time period
if (sum(nightInt$val, na.rm=TRUE) < 5){
NightMean <- mean(nightInt$val, na.rm =TRUE)
NightSD <-sd(nightInt$val, na.rm =TRUE)
} else {
NightMean <- NA
NightSD <- NA
}
if (sum(dayInt$val, na.rm=TRUE) > 5){
DayMean <- mean(dayInt$val, na.rm =TRUE)
DaySD <-sd(dayInt$val, na.rm =TRUE)
} else {
DayMean <- NA
DaySD <- NA
}
NightVal <- c(NightVal, NightMean)
NightStd <- c(NightStd, NightSD)
DayVal <- c(gsrDayVal, DayMean)
DayStd <- c(gsrDayStd, DaySD)
date <-c(date, as.Date(today))
}
df<-data.frame(date,DayVal,DayStd,NightVal, NightStd)
# plot for the subdirectory
p1 <- ggplot() +
geom_point(data = df, aes(x = date, y = gsrDayVal, color = "Day Average")) +
geom_point(data = df, aes(x = date, y = gsrDayStd, color = "Day Standard Dev")) +
geom_point(data = df, aes(x = date, y = gsrNightVal, color = "Night Average")) +
geom_point(data = df, aes(x = date, y = gsrNightStd, color = "Night Standard Dev")) +
scale_colour_manual(values = c("steelblue", " turquoise3", "purple3", "violet"))
}
Thanks very much for any advice you can offer!
回答1:
Consider an SQL database solution as you manage quite a bit of data in flatfiles. A Relational Database Management System (RDMS) can easily handle millions of records, even aggregate as needed using its scalable db engine rather than processing in memory per R. If not for speed and efficiency, databases can provide security, robustness, and organization as the central repository. Even work a script to import each daily csv thereafter directly into database.
Fortunately, practically all RDMS have CSV handlers and can load mulitple files in bulk. Below are open source solutions: SQLite (file level database), MySQL, and PostgreSQL (both server level databases), all of which have corresponding libraries in R. Each example recursively imports a csv file from directory list of files into database table named timeseriesdata
(with same named fields/data types as csv files). At the end is one SQL call to import an aggregation of Night and Day interval mean and standard deviation (adjust as needed). The only challenge is designating a file and subdirectory indicator (which may or may not exist in actual data) and appending with csv files (possibly after each iteration, run an update query to a FileID
column).
dir <- list.dirs(path = "/ParentDirectory",
full.names = TRUE, recursive = FALSE)
# SQLITE DATABASE
library(RSQLite)
sqconn <- dbConnect(RSQLite::SQLite(), dbname = "/path/to/database.db")
# (CONNECTION NOT NEEDED DUE TO CMD LINE LOAD BELOW)
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT VIA COMMAND LINE OR BASH (ASSUMES SQLITE3 IS PATH VARIABLE)
cmd <- paste0("(echo .separator ,; echo .import ' ", csvfile , " ' timeseriesdata ')",
" '| sqlite3 ' /path/to/databasename.db")
system(cmd)
}
}
# CLOSE CONNNECTION
dbDisconnect(sqconn)
# MYSQL DATABASE
library(RMySQL)
myconn <- dbConnect(RMySQL::MySQL(), dbname="databasename", host="hostname",
username="username", password="***")
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT USING LOAD DATA INFILE COMMAND
sql <- paste0("LOAD DATA INFILE '", csvfile, "'
INTO TABLE timeseriesdata
FIELDS TERMINATED BY ','
ENCLOSED BY '\"'
ESCAPED BY '\"'
LINES TERMINATED BY '\\n'
IGNORE 1 LINES
(col1, col2, col3, col4, col5);")
dbSendQuery(myconn, sql)
dbCommit(myconn)
}
}
# CLOSE CONNECTION
dbDisconnect(myconn)
# POSTGRESQL DATABASE
library(RPostgreSQL)
pgconn <- dbConnect(PostgreSQL(), dbname="databasename", host="myhost",
user= "postgres", password="***")
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT USING COPY COMMAND
sql <- paste("COPY timeseriesdata(col1, col2, col3, col4, col5)
FROM '", csvfile , "' DELIMITER ',' CSV;")
dbSendQuery(pgconn, sql)
}
}
# CLOSE CONNNECTION
dbDisconnect(pgconn)
# CREATE PLOT DATA FRAME (MYSQL EXAMPLE)
# (ADD INSIDE SUBDIRECTORY LOOP OR INCLUDE SUBDIR COLUMN IN GROUP BY)
library(RMySQL)
myconn <- dbConnect(RMySQL::MySQL(), dbname="databasename", host="hostname",
username="username", password="***")
# AGGREGATE QUERY USING TWO DERIVED TABLE SUBQUERIES
# (FOR NIGHT AND DAY, ADJUST FILTERS PER NEEDS)
strSQL <- "SELECT dt.FileID, NightMean, NightSTD, DayMean, DaySTD
FROM
(SELECT nt.FileID, Avg(nt.time) As NightMean, STDDEV(nt.time) As NightSTD
FROM timeseriesdata nt
WHERE nt.time >= '15:00:00' AND nt.time <= '21:00:00'
GROUP BY nt.FileID
HAVING Sum(nt.val) < 5) AS ng
INNER JOIN
(SELECT dt.FileID, Avg(dt.time) As DayMean, STDDEV(dt.time) As DaySTD
FROM timeseriesdata dt
WHERE dt.time >= '03:00:00' AND dt.time <= '09:00:00'
GROUP BY dt.FileID
HAVING Sum(dt.val) > 5) AS dy
ON ng.FileID = dy.FileID;"
df <- dbSendQuery(myconn, strSQL)
dbFetch(df)
dbDisconnect(myconn)
回答2:
One thing would be to do the conversion of day$time once instead of all the times you are doing it now. Also use the lubridate package because if you have a large number of times to convert, it is much faster than 'as.POSIXct'.
Also size the variables you are storing results in, e.g., DayVal, DayStd, to the approriate size (DayVal <- numeric(num)) and then index the result into the appropriate index.
If the CSV files are large, consider using the 'fread' function in data.table package.
来源:https://stackoverflow.com/questions/34191127/speed-up-r-script-looping-through-files-folders-to-check-thresholds-calculate-a