When importing CSV into R how to generate column with name of the CSV?

后端 未结 6 572
情书的邮戳
情书的邮戳 2020-12-01 06:45

I have a large number of csv files that I want to read into R. All the Column headings in the csvs are the same. At first I thought I would need to create a loop based on th

相关标签:
6条回答
  • 2020-12-01 07:16

    Kinda messy but works:

    filenames <- c("foo.csv","bar.csv")
    import.list <- list(matrix(,4,4),matrix(6,6))
    
    source <- unlist(sapply(1:length(filenames),function(i)rep(gsub(".csv","",filenames[i]),nrow(import.list[[i]]))))
    
    source
    [1] "foo" "foo" "foo" "foo" "bar" "bar" "bar" "bar" "bar" "bar"
    
    combined$source <- source
    
    0 讨论(0)
  • 2020-12-01 07:19

    Found this one working for me, which creates new column plus merging whole folder csv files.

    Using setNames():

    file.list <- list.files(pattern = '*.csv')
    file.list <- setNames(file.list, file.list)
    
    df.list <- lapply(file.list, read_csv)
    df.list <- Map(function(df, name) {
      df$issue <- name
      df
    }, df.list, names(df.list))
    df <- rbindlist(df.list,use.names = TRUE, fill = TRUE, idcol = "Issue")
    

    This one creates new column of the source file, and merge them.

    0 讨论(0)
  • 2020-12-01 07:22

    data.table solution

    Update: here is a complete data.table solution for this, using the keep.rownames. Assuming all your CSVs are in one folder:

    library(data.table)
    my.path <- "C:/some/path/to/your/folder" #set the path
    filenames <- paste(my.path, list.files(path=my.path), sep="/") #list of files
    
    #this will create a rn column with the path in it
    my.dt<- data.table(do.call("rbind", sapply(filenames, read.csv,     
                      sep=";")), keep.rownames = T)
    

    Basic syntax solution

    I used Grothendieck's solution and added a line to create a column from the row names. As simple as that:

    something <- do.call("rbind", sapply(filenames, read.csv, sep=";", simplify = FALSE)) 
    something$mycolumn <- row.names(something)
    

    If you only want a part of the filename, replace the 2nd line by something like this:

    something$mycolumn <- substring(row.names(something),1,3)
    

    This will use the 1st 3 characters from the filename as the value in the new column.

    0 讨论(0)
  • 2020-12-01 07:24

    Try this:

    do.call("rbind", sapply(filenames, read.csv, simplify = FALSE))
    

    The row names will indicate the source and line number.

    0 讨论(0)
  • 2020-12-01 07:27

    Here is a solution using the import_list() function from rio, which is designed exactly for this purpose.

    # setup some example files to import
    rio::export(mtcars, "mtcars1.csv")
    rio::export(mtcars, "mtcars2.csv")
    rio::export(mtcars, "mtcars3.csv")
    

    The default behavior of import_list() is to get a list of data frames:

    str(rio::import_list(dir(pattern = "mtcars")), 1)
    ## List of 3
    ##  $ :'data.frame':       32 obs. of  11 variables:
    ##  $ :'data.frame':       32 obs. of  11 variables:
    ##  $ :'data.frame':       32 obs. of  11 variables:
    

    But you can use the rbind argument to instead construct a single data frame (note the _file column at the end):

    str(rio::import_list(dir(pattern = "mtcars"), rbind = TRUE))
    ## 'data.frame':   96 obs. of  12 variables:
    ##  $ mpg  : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
    ##  $ cyl  : int  6 6 4 6 8 6 8 4 4 6 ...
    ##  $ disp : num  160 160 108 258 360 ...
    ##  $ hp   : int  110 110 93 110 175 105 245 62 95 123 ...
    ##  $ drat : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
    ##  $ wt   : num  2.62 2.88 2.32 3.21 3.44 ...
    ##  $ qsec : num  16.5 17 18.6 19.4 17 ...
    ##  $ vs   : int  0 0 1 1 0 1 0 1 1 1 ...
    ##  $ am   : int  1 1 1 0 0 0 0 0 0 0 ...
    ##  $ gear : int  4 4 4 3 3 3 3 4 4 4 ...
    ##  $ carb : int  4 4 1 1 2 1 4 2 2 4 ...
    ##  $ _file: chr  "mtcars1.csv" "mtcars1.csv" "mtcars1.csv" "mtcars1.csv" ...
    

    and the rbind_label argument to specify the name of the column that identifies each file:

    str(rio::import_list(dir(pattern = "mtcars"), rbind = TRUE, rbind_label = "source"))
    ## 'data.frame':   96 obs. of  12 variables:
    ##  $ mpg   : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
    ##  $ cyl   : int  6 6 4 6 8 6 8 4 4 6 ...
    ##  $ disp  : num  160 160 108 258 360 ...
    ##  $ hp    : int  110 110 93 110 175 105 245 62 95 123 ...
    ##  $ drat  : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
    ##  $ wt    : num  2.62 2.88 2.32 3.21 3.44 ...
    ##  $ qsec  : num  16.5 17 18.6 19.4 17 ...
    ##  $ vs    : int  0 0 1 1 0 1 0 1 1 1 ...
    ##  $ am    : int  1 1 1 0 0 0 0 0 0 0 ...
    ##  $ gear  : int  4 4 4 3 3 3 3 4 4 4 ...
    ##  $ carb  : int  4 4 1 1 2 1 4 2 2 4 ...
    ##  $ source: chr  "mtcars1.csv" "mtcars1.csv" "mtcars1.csv" "mtcars1.csv" ...
    

    For full disclosure: I am the maintainer of rio.

    0 讨论(0)
  • 2020-12-01 07:34

    You have already done all the hard work. With a fairly small modification this should be straight-forward.

    The logic is:

    1. Create a small helper function that reads an individual csv and adds a column with the file name.
    2. Call this helper function in llply()

    The following should work:

    read_csv_filename <- function(filename){
        ret <- read.csv(filename)
        ret$Source <- filename #EDIT
        ret
    }
    
    import.list <- ldply(filenames, read_csv_filename)
    

    Note that I have proposed another small improvement to your code: read.csv() returns a data.frame - this means you can use ldply() rather than llply().

    0 讨论(0)
提交回复
热议问题