问题
This is a follow up on my last question (Rbinding large list of dataframes after I did some data cleaning on the list). I've gotten smarter and the former question got messy.
I have 43 xlsx files which I loaded in to a list in R:
file.list <- list.files(recursive=T,pattern='*.xlsx')
dat = lapply(file.list, function(i){
x = read_xlsx(i, sheet=1, col_names = T)
# Create column with file name
x$file = i
# Return data
x
})
I then added some column names:
my_names <- c("ID", "UDLIGNNR","BILAGNR", "AKT", "BA",
"IART", "HTRANS", "DTRANS", "BELOB", "REGD",
"BOGFD", "VALORD", "UDLIGND",
"UÅ", "AFSTEMNGL", "NRBASIS","FIBILAG", "FILE")
dat <- lapply(dat, setNames, my_names)
Then I removed some columns:
dat <- lapply(dat, function(x) { x["UÅ"] <- NULL; x })
dat <- lapply(dat, function(x) { x["FIBILAG"] <- NULL; x })
I don't really need to remove them, but when I tried to merge the dataframes I kept getting errors about the class of these. So I just removed them.
Then I changes all my columns to character. I'm kinda new to R, so I'm aware that this code isn't very sexy and you probably could have made a loop or just one function for this. But this is what I did:
dat <- lapply(dat, function(x) { x["ID"] <- as.character(x["ID"]); x })
[I did the same for all columns]
And then I go and bind the data.
df <- rbindlist(dat)
EDIT:
I've found that it isn't my binding methods that's the problem (thank you for your inputs on that). I've deleted the part about the binding methods.
The problem lies in how I change the coltypes on my columns in my dataframes within the list.
I've also tried:
dat <- lapply(dat, function(x) { x[,"ID"] <- as.character(x[,"ID"]); x })
I added a comma before "ID". This didn't help. I feel like I need to use unlist, but I'm not sure how to use it here?
回答1:
If you have a list of data frames dat
i would recommend
df <- dplyr::bind_rows(dat)
to row bind them to one big data frame.
Alternatively you should use the purrr map family and return a row binded data.frame directly?
df <- purrr::map_dfr(file.list, function(x) readxl::read_xslx(x))
回答2:
I found a solution!!
Thank you for your help!
Apparently the problem wasn't in the binding of the data frames in the nested list. The problem was that I was changing the column types the wrong way.
Here's my code - and it works! And it's super faster than the other one!
file.list <- list.files(recursive=T,pattern='*.xlsx')
dat = lapply(file.list, function(i){
x = read_xlsx(i, sheet=1, col_names = T)
# Create column with file name
x$file = i
# Return data
x
})
# Setting column names
my_names <- c("ID", "UDLIGNNR","BILAGNR", "AKT", "BA",
"IART", "HTRANS", "DTRANS", "BELOB", "REGD",
"BOGFD", "VALORD", "UDLIGND",
"UÅ", "AFSTEMNGL", "NRBASIS","FIBILAG", "FILE")
dat <- lapply(dat, setNames, my_names)
# Removing problematic columns
dat <- lapply(dat, function(x) { x["UÅ"] <- NULL; x })
dat <- lapply(dat, function(x) { x["FIBILAG"] <- NULL; x })
dat2 <- lapply(dat, function(df) setDT(df)[, (1:16) := lapply(.SD, as.character), .SDcols = 1:16])
# Merging
df <- rbindlist(dat2)
Oh and several people told me to use bind_rows instead (@atomman and @Probel)
And I want to give credit to the person whom I stole the first part from, but I can't remember...
来源:https://stackoverflow.com/questions/54591376/binding-dataframes-in-list-after-data-cleaning-on-list