Add “filename” column to table as multiple files are read and bound

前端 未结 6 2092
轮回少年
轮回少年 2020-11-27 20:54

I have numerous csv files in multiple directories that I want to read into a R tribble or data.table. I use \"list.files()\" with the recursive argument set to TRUE to creat

相关标签:
6条回答
  • 2020-11-27 20:59

    You could use purrr::map2 here, which works similarly to mapply

    filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
    sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")  # same length as filenames
    
    library(purrr)
    library(dplyr)
    library(readr)
    stopifnot(length(filenames)==length(sites))  # returns error if not the same length
    ans <- map2(filenames, sites, ~read_csv(.x) %>% mutate(id = .y))  # .x is element in filenames, and .y is element in sites
    

    The output of map2 is a list, similar to lapply

    If you have a development version of purrr, you can use imap, which is a wrapper for map2 with an index

    0 讨论(0)
  • 2020-11-27 21:02

    You just need to write your own function that reads the csv and adds the column you want, before combining them.

    my_read_csv <- function(x) {
      out <- read_csv(x)
      site <- str_extract(x, "[A-Z]{2}-[A-Za-z0-9]{3}")
      cbind(Site=site, out)
    }
    
    filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
    tbl <- lapply(filenames, my_read_csv) %>% bind_rows()
    
    0 讨论(0)
  • 2020-11-27 21:03

    The tidyverse provides an eloquent solution. I like to use the full file-path as the filename (which can later be truncated, if desired).

    An example loading .csv files:

    library(tidyverse); library(fs)
    

    specify file path

    data_dir <- path("file/directory")
    data_list = fs::dir_ls(data_dir, regexp = "\\.csv$")
    

    read data

    my_data = data_list %>% 
      purrr::map_dfr(read_csv, .id = "source")
    

    rename variables

    my_data_renamed <- my_data %>% 
      dplyr::mutate(source = stringr::str_replace(source, "text-to-replace", "new-text"))
    #where source is the renamed file-source column      
    
    0 讨论(0)
  • 2020-11-27 21:04

    data.table approach:

    If you name the list, then you can use this name to add to the data.table when binding the list together.

    workflow

    files <- list.files( whatever... )
    #read the files from the list
    l <- lapply( files, fread )
    #names the list using the basename from `l`
    # this also is the step to manipuly the filesnamaes to whatever you like
    names(l) <- basename( l )
    #bind the rows from the list togetgher, putting the filenames into the colum "id"
    dt <- rbindlist( dt.list, idcol = "id" )
    
    0 讨论(0)
  • 2020-11-27 21:23

    I generally use the following approach, based on dplyr/tidyr:

    data = tibble(File = files) %>%
        extract(File, "Site", "([A-Z]{2}-[A-Za-z0-9]{3})", remove = FALSE) %>%
        mutate(Data = lapply(File, read_csv)) %>%
        unnest(Data) %>%
        select(-File)
    
    0 讨论(0)
  • 2020-11-27 21:25

    You can build a filenames vector based on "sites" with the exact same length as tbl and then combine the two using cbind

    ### Get file names
    filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
    sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
    
    ### Get length of each csv
    file_lengths <- unlist(lapply(lapply(filenames, read_csv), nrow))
    
    ### Repeat sites using lengths
    file_names <- rep(sites,file_lengths))
    
    ### Create table
    tbl <- lapply(filenames, read_csv) %>% 
      bind_rows()
    
    ### Combine file_names and tbl
    tbl <- cbind(tbl, filename = file_names)
    
    0 讨论(0)
提交回复
热议问题