Reading multiple files and calculating mean based on user input

问题

I am trying to write a function in R which takes 3 inputs:

Directory
pollutant
id

I have a directory on my computer full of CSV\'s files i.e. over 300. What this function would do is shown in the below prototype:

pollutantmean <- function(directory, pollutant, id = 1:332) {
        ## \'directory\' is a character vector of length 1 indicating
        ## the location of the CSV files

        ## \'pollutant\' is a character vector of length 1 indicating
        ## the name of the pollutant for which we will calculate the
        ## mean; either \"sulfate\" or \"nitrate\".

        ## \'id\' is an integer vector indicating the monitor ID numbers
        ## to be used

        ## Return the mean of the pollutant across all monitors list
        ## in the \'id\' vector (ignoring NA values)
        }

An example output of this function is shown here:

source(\"pollutantmean.R\")
pollutantmean(\"specdata\", \"sulfate\", 1:10)

## [1] 4.064

pollutantmean(\"specdata\", \"nitrate\", 70:72)

## [1] 1.706

pollutantmean(\"specdata\", \"nitrate\", 23)

## [1] 1.281

I can read the whole thing in one go by:

path = \"C:/Users/Sean/Documents/R Projects/Data/specdata\"
fileList = list.files(path=path,pattern=\"\\\\.csv$\",full.names=T)
all.files.data = lapply(fileList,read.csv,header=TRUE)
DATA = do.call(\"rbind\",all.files.data)

My issue are:

User enters id either atomic or in a range e.g. suppose user enters 1 but the file name is 001.csv or what if user enters a range 1:10 then file names are 001.csv ... 010.csv
Column is enetered by user i.e. \"sulfate\" or \"nitrate\" which he/she is interested in getting the mean of...There are alot of missing values in these columns (which i need to omit from the column before calculating the mean.

The whole data from all the files look like this :

summary(DATA)
         Date           sulfate          nitrate             ID       
 2004-01-01:   250   Min.   : 0.0     Min.   : 0.0     Min.   :  1.0  
 2004-01-02:   250   1st Qu.: 1.3     1st Qu.: 0.4     1st Qu.: 79.0  
 2004-01-03:   250   Median : 2.4     Median : 0.8     Median :168.0  
 2004-01-04:   250   Mean   : 3.2     Mean   : 1.7     Mean   :164.5  
 2004-01-05:   250   3rd Qu.: 4.0     3rd Qu.: 2.0     3rd Qu.:247.0  
 2004-01-06:   250   Max.   :35.9     Max.   :53.9     Max.   :332.0  
 (Other)   :770587   NA\'s   :653304   NA\'s   :657738

Any idea how to formulate this would be highly appreciated...

Cheers

回答1:

So, you can simulate your situation like this;

# Simulate some data:
# Create 332 data frames
set.seed(1)
df.list<-replicate(332,data.frame(sulfate=rnorm(100),nitrate=rnorm(100)),simplify=FALSE)
# Generate names like 001.csv and 010.csv
file.names<-paste0('specdata/',sprintf('%03d',1:332),'.csv')
# Write them to disk
invisible(mapply(write.csv,df.list,file.names))

And here is a function that would read those files:

pollutantmean <- function(directory, pollutant, id = 1:332) {
  file.names <- list.files(directory)
  file.numbers <- as.numeric(sub('\\.csv$','', file.names))
  selected.files <- na.omit(file.names[match(id, file.numbers)])
  selected.dfs <- lapply(file.path(directory,selected.files), read.csv)
  mean(c(sapply(selected.dfs, function(x) x[ ,pollutant])), na.rm=TRUE)
}

pollutantmean('specdata','nitrate',c(1:100,141))
# [1] -0.005450574

回答2:

User enters id either atomic or in a range e.g. 
suppose user enters 1 but the file name is 001.csv or what if user enters a range 1:10 then file names are 001.csv ... 010.csv

You could use a regular expression and the gsub function to remove leading zeros from the file names, then make a dictionary (in r, a named vector) to convert the modified/gsub'd file names to the actual file names. Ex: if your file names are in a character vector, fnames

fnames = c("001.csv","002.csv")
names(fnames) <- gsub(pattern="^[0]*", replacement="", x=fnames)

With this, the vector fnames is converted to a dictionary, letting you call up the file named 001.csv with something along the lines of fnames["1.csv"]. You can also use gsub() to remove the .csv part of the file name.

Column is enetered by user i.e. "sulfate" or "nitrate" which he/she is interested in getting the mean of...There are alot of missing values in these columns (which i need to omit from the column before calculating the mean.

Many R functions have an option for ignoring the special character indicating a missing value. Try entering help(mean) at the R command prompt to find information on this functionality.

回答3:

That's the way I fixed it:

pollutantmean <- function(directory, pollutant, id = 1:332) {
    #set the path
    path = directory

    #get the file List in that directory
    fileList = list.files(path)

    #extract the file names and store as numeric for comparison
    file.names = as.numeric(sub("\\.csv$","",fileList))

    #select files to be imported based on the user input or default
    selected.files = fileList[match(id,file.names)]

    #import data
    Data = lapply(file.path(path,selected.files),read.csv)

    #convert into data frame
    Data = do.call(rbind.data.frame,Data)

    #calculate mean
    mean(Data[,pollutant],na.rm=TRUE)

    }

The last question is that my function should call "specdata" (the directory name where all the csv's are located) as the directory, is there a directory type object in r?

suppose i call the function as:

pollutantmean(specdata, "niterate", 1:10)

It should get the path of specdata directory which is on my working directory... how can I do that?

回答4:

Here is a solution that even your grandmother could understand:

pollutantmean <- function(directory, pollutant, id = 1:332) {

  # Break this function up into a series of smaller functions
  # that do exactly what you expect them to. Your friends
  # will love you for it.

  csvFiles = getFilesById(id, directory)

  dataFrames = readMultipleCsvFiles(csvFiles)

  dataFrame = bindMultipleDataFrames(dataFrames)

  getColumnMean(dataFrame, column = pollutant)
}


getFilesById <- function(id, directory = getwd()) {
  allFiles = list.files(directory)
  file.path(directory, allFiles[id])
}

readMultipleCsvFiles <- function(csvFiles) {
  lapply(csvFiles, read.csv)
}

bindMultipleDataFrames <- function(dataFrames) {
  Reduce(function(x, y) rbind(x, y), dataFrames)
}

getColumnMean <- function(dataFrame, column, ignoreNA = TRUE) {
  mean(dataFrame[ , column], na.rm = ignoreNA)
}

回答5:

Here's a somewhat general function for calculating the mean for a specific column over a list of files. Not sure how id should be set up, but right now it acts as an indexing vector (i.e. id = 1:3 calculates the mean for the first three files in the file list).

multifile.means <- function(directory = getwd(), pollutant, id = NULL)
{
    d <- match.arg(directory, list.files())
    cn <- match.arg(pollutant,  c('sulfate', 'nitrate'))
    ## get a vector of complete file paths in the given 'directory'
    p <- dir(d, full.names = TRUE)
    ## subset 'p' based on 'id' values
    if(!is.null(id)){
        id <- id[!id > length(p)]
        p <- p[id]
    }
    ## read, store, and name the relevant columns
    cl <- sapply(p, function(x){ read.csv(x)[,cn] }, USE.NAMES = FALSE)
    colnames(cl) <- basename(p)
    ## return a named list of some results
    list(values = cl, 
         mean = mean(cl, na.rm = TRUE), 
         colMeans = colMeans(cl, na.rm = TRUE))
}

Take it for a test-drive:

> multifile.means('testDir', 'sulfate')
# $values
#      001.csv 057.csv 146.csv 213.csv
# [1,]       5      10      NA       9
# [2,]       1       1      10       3
# [3,]      10       4      10       2
# [4,]       3      10       9      NA
# [5,]       4       1       5       5

# $mean
# [1] 5.666667

# $colMeans
# 001.csv 057.csv 146.csv 213.csv 
#    4.60    5.20    8.50    4.75

回答6:

The selected answer looks good but here's an alternative. This answer works well for the basics covered by the JHU course.

pollutantmean <- function(directory, pollutant, id = 1:332) {
    csvfiles <- dir(directory, "*\\.csv$", full.names = TRUE)
    data <- lapply(csvfiles[id], read.csv)
    numDataPoints <- 0L
    total <- 0L
    for (filedata in data) {
        d <- filedata[[pollutant]] # relevant column data
        d <- d[complete.cases(d)] # remove NA values
        numDataPoints <- numDataPoints + length(d)
        total <- total + sum(d)
    }
    total / numDataPoints
}

回答7:

It took me a couple of hours to work this out, but here is my (shorter) version

pollutmean<- function(dir, pollutant, id=1:332) {
  dir<- list.files(dir, full.names = T)     #list files
  dat<- data.frame()                        #make empty df
  for (i in id) {
    dat <- rbind(dat, read.csv(dir[i]))     #rbind all files
  }
  mean(dat[,pollutant], na.rm = TRUE)       #calculate mean of given column
}

pollutmean("assign/specdata", "sulfate", id=1:60)

回答8:

I was reading the course as well, and came up with the following solution:

pollutantmean <- function(directory="d:/dev/r/documents/specdata",       pollutant, 
                      id)   {
myfilename = paste(directory,"/",formatC(id, width=3, flag="0"),".csv",
                   sep="")
master = lapply(myfilename, read.table, header=TRUE, sep=",")
masterfile = do.call("rbind", master)
head(masterfile[[2]], 100)

if (pollutant == "sulfate") {
    #result=lapply(masterfile[[2]], mean, na.rm=TRUE)
    result=mean(masterfile[[2]], na.rm=TRUE)  

}
if (pollutant == "nitrate") {
    result=mean(masterfile[[3]], na.rm=TRUE)

}
result
}

来源：https://stackoverflow.com/questions/23640594/reading-multiple-files-and-calculating-mean-based-on-user-input

标签

function

subset

mean

missing-data