I am trying to write a function in R which takes 3 inputs:
- Directory
- pollutant
- id
I have a directory on my computer full of CSV's files i.e. over 300. What this function would do is shown in the below prototype:
pollutantmean <- function(directory, pollutant, id = 1:332) {
## 'directory' is a character vector of length 1 indicating
## the location of the CSV files
## 'pollutant' is a character vector of length 1 indicating
## the name of the pollutant for which we will calculate the
## mean; either "sulfate" or "nitrate".
## 'id' is an integer vector indicating the monitor ID numbers
## to be used
## Return the mean of the pollutant across all monitors list
## in the 'id' vector (ignoring NA values)
}
An example output of this function is shown here:
source("pollutantmean.R")
pollutantmean("specdata", "sulfate", 1:10)
## [1] 4.064
pollutantmean("specdata", "nitrate", 70:72)
## [1] 1.706
pollutantmean("specdata", "nitrate", 23)
## [1] 1.281
I can read the whole thing in one go by:
path = "C:/Users/Sean/Documents/R Projects/Data/specdata"
fileList = list.files(path=path,pattern="\\.csv$",full.names=T)
all.files.data = lapply(fileList,read.csv,header=TRUE)
DATA = do.call("rbind",all.files.data)
My issue are:
- User enters id either atomic or in a range e.g. suppose user enters 1 but the file name is 001.csv or what if user enters a range 1:10 then file names are 001.csv ... 010.csv
- Column is enetered by user i.e. "sulfate" or "nitrate" which he/she is interested in getting the mean of...There are alot of missing values in these columns (which i need to omit from the column before calculating the mean.
The whole data from all the files look like this :
summary(DATA)
Date sulfate nitrate ID
2004-01-01: 250 Min. : 0.0 Min. : 0.0 Min. : 1.0
2004-01-02: 250 1st Qu.: 1.3 1st Qu.: 0.4 1st Qu.: 79.0
2004-01-03: 250 Median : 2.4 Median : 0.8 Median :168.0
2004-01-04: 250 Mean : 3.2 Mean : 1.7 Mean :164.5
2004-01-05: 250 3rd Qu.: 4.0 3rd Qu.: 2.0 3rd Qu.:247.0
2004-01-06: 250 Max. :35.9 Max. :53.9 Max. :332.0
(Other) :770587 NA's :653304 NA's :657738
Any idea how to formulate this would be highly appreciated...
Cheers
So, you can simulate your situation like this;
# Simulate some data:
# Create 332 data frames
set.seed(1)
df.list<-replicate(332,data.frame(sulfate=rnorm(100),nitrate=rnorm(100)),simplify=FALSE)
# Generate names like 001.csv and 010.csv
file.names<-paste0('specdata/',sprintf('%03d',1:332),'.csv')
# Write them to disk
invisible(mapply(write.csv,df.list,file.names))
And here is a function that would read those files:
pollutantmean <- function(directory, pollutant, id = 1:332) {
file.names <- list.files(directory)
file.numbers <- as.numeric(sub('\\.csv$','', file.names))
selected.files <- na.omit(file.names[match(id, file.numbers)])
selected.dfs <- lapply(file.path(directory,selected.files), read.csv)
mean(c(sapply(selected.dfs, function(x) x[ ,pollutant])), na.rm=TRUE)
}
pollutantmean('specdata','nitrate',c(1:100,141))
# [1] -0.005450574
User enters id either atomic or in a range e.g.
suppose user enters 1 but the file name is 001.csv or what if user enters a range 1:10 then file names are 001.csv ... 010.csv
You could use a regular expression and the gsub
function to remove leading zeros from the file names, then make a dictionary (in r, a named vector) to convert the modified/gsub'd file names to the actual file names.
Ex: if your file names are in a character vector, fnames
fnames = c("001.csv","002.csv")
names(fnames) <- gsub(pattern="^[0]*", replacement="", x=fnames)
With this, the vector fnames is converted to a dictionary, letting you call up the file named 001.csv
with something along the lines of fnames["1.csv"]
. You can also use gsub()
to remove the .csv
part of the file name.
Column is enetered by user i.e. "sulfate" or "nitrate" which he/she is interested in getting the mean of...There are alot of missing values in these columns (which i need to omit from the column before calculating the mean.
Many R functions have an option for ignoring the special character indicating a missing value. Try entering help(mean)
at the R command prompt to find information on this functionality.
Here is a solution that even your grandmother could understand:
pollutantmean <- function(directory, pollutant, id = 1:332) {
# Break this function up into a series of smaller functions
# that do exactly what you expect them to. Your friends
# will love you for it.
csvFiles = getFilesById(id, directory)
dataFrames = readMultipleCsvFiles(csvFiles)
dataFrame = bindMultipleDataFrames(dataFrames)
getColumnMean(dataFrame, column = pollutant)
}
getFilesById <- function(id, directory = getwd()) {
allFiles = list.files(directory)
file.path(directory, allFiles[id])
}
readMultipleCsvFiles <- function(csvFiles) {
lapply(csvFiles, read.csv)
}
bindMultipleDataFrames <- function(dataFrames) {
Reduce(function(x, y) rbind(x, y), dataFrames)
}
getColumnMean <- function(dataFrame, column, ignoreNA = TRUE) {
mean(dataFrame[ , column], na.rm = ignoreNA)
}
That's the way I fixed it:
pollutantmean <- function(directory, pollutant, id = 1:332) {
#set the path
path = directory
#get the file List in that directory
fileList = list.files(path)
#extract the file names and store as numeric for comparison
file.names = as.numeric(sub("\\.csv$","",fileList))
#select files to be imported based on the user input or default
selected.files = fileList[match(id,file.names)]
#import data
Data = lapply(file.path(path,selected.files),read.csv)
#convert into data frame
Data = do.call(rbind.data.frame,Data)
#calculate mean
mean(Data[,pollutant],na.rm=TRUE)
}
The last question is that my function should call "specdata" (the directory name where all the csv's are located) as the directory, is there a directory type object in r?
suppose i call the function as:
pollutantmean(specdata, "niterate", 1:10)
It should get the path of specdata directory which is on my working directory... how can I do that?
Here's a somewhat general function for calculating the mean for a specific column over a list of files. Not sure how id
should be set up, but right now it acts as an indexing vector (i.e. id = 1:3
calculates the mean for the first three files in the file list).
multifile.means <- function(directory = getwd(), pollutant, id = NULL)
{
d <- match.arg(directory, list.files())
cn <- match.arg(pollutant, c('sulfate', 'nitrate'))
## get a vector of complete file paths in the given 'directory'
p <- dir(d, full.names = TRUE)
## subset 'p' based on 'id' values
if(!is.null(id)){
id <- id[!id > length(p)]
p <- p[id]
}
## read, store, and name the relevant columns
cl <- sapply(p, function(x){ read.csv(x)[,cn] }, USE.NAMES = FALSE)
colnames(cl) <- basename(p)
## return a named list of some results
list(values = cl,
mean = mean(cl, na.rm = TRUE),
colMeans = colMeans(cl, na.rm = TRUE))
}
Take it for a test-drive:
> multifile.means('testDir', 'sulfate')
# $values
# 001.csv 057.csv 146.csv 213.csv
# [1,] 5 10 NA 9
# [2,] 1 1 10 3
# [3,] 10 4 10 2
# [4,] 3 10 9 NA
# [5,] 4 1 5 5
# $mean
# [1] 5.666667
# $colMeans
# 001.csv 057.csv 146.csv 213.csv
# 4.60 5.20 8.50 4.75
The selected answer looks good but here's an alternative. This answer works well for the basics covered by the JHU course.
pollutantmean <- function(directory, pollutant, id = 1:332) {
csvfiles <- dir(directory, "*\\.csv$", full.names = TRUE)
data <- lapply(csvfiles[id], read.csv)
numDataPoints <- 0L
total <- 0L
for (filedata in data) {
d <- filedata[[pollutant]] # relevant column data
d <- d[complete.cases(d)] # remove NA values
numDataPoints <- numDataPoints + length(d)
total <- total + sum(d)
}
total / numDataPoints
}
It took me a couple of hours to work this out, but here is my (shorter) version
pollutmean<- function(dir, pollutant, id=1:332) {
dir<- list.files(dir, full.names = T) #list files
dat<- data.frame() #make empty df
for (i in id) {
dat <- rbind(dat, read.csv(dir[i])) #rbind all files
}
mean(dat[,pollutant], na.rm = TRUE) #calculate mean of given column
}
pollutmean("assign/specdata", "sulfate", id=1:60)
I was reading the course as well, and came up with the following solution:
pollutantmean <- function(directory="d:/dev/r/documents/specdata", pollutant,
id) {
myfilename = paste(directory,"/",formatC(id, width=3, flag="0"),".csv",
sep="")
master = lapply(myfilename, read.table, header=TRUE, sep=",")
masterfile = do.call("rbind", master)
head(masterfile[[2]], 100)
if (pollutant == "sulfate") {
#result=lapply(masterfile[[2]], mean, na.rm=TRUE)
result=mean(masterfile[[2]], na.rm=TRUE)
}
if (pollutant == "nitrate") {
result=mean(masterfile[[3]], na.rm=TRUE)
}
result
}
来源:https://stackoverflow.com/questions/23640594/reading-multiple-files-and-calculating-mean-based-on-user-input