sapply vs. lapply while reading files and rbind'ing them

天大地大妈咪最大 提交于 2019-12-08 06:09:19

问题


I followed Hadley's thread: Issue in Loading multiple .csv files into single dataframe in R using rbind to read multiple CSV files and then convert them to one dataframe. I also experimented with lapply vs. sapply as discussed on Grouping functions (tapply, by, aggregate) and the *apply family.

Here's my first CSV file:

dput(File1)
structure(list(First.Name = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("A", 
"C"), class = "factor"), Last.Name = structure(c(1L, 2L, 2L, 
2L, 2L), .Label = c("B", "D"), class = "factor"), Income = c(55L, 
23L, 34L, 45L, 44L), Tax = c(23L, 21L, 22L, 24L, 25L), Location = structure(c(3L, 
3L, 1L, 4L, 2L), .Label = c("Americas", "AP", "EMEA", "LATAM"
), class = "factor")), .Names = c("First.Name", "Last.Name", 
"Income", "Tax", "Location"), class = "data.frame", row.names = c(NA, 
-5L))

Here's my second CSV file:

dput(File2)
structure(list(First.Name = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("A", 
"C"), class = "factor"), Last.Name = structure(c(1L, 2L, 2L, 
2L, 2L), .Label = c("B", "D"), class = "factor"), Income = c(55L, 
55L, 55L, 55L, 55L), Tax = c(24L, 24L, 24L, 24L, 24L), Location = structure(c(3L, 
3L, 1L, 4L, 2L), .Label = c("Americas", "AP", "EMEA", "LATAM"
), class = "factor")), .Names = c("First.Name", "Last.Name", 
"Income", "Tax", "Location"), class = "data.frame", row.names = c(NA, 
-5L))

Here's my code:

dat1 <-",First.Name,Last.Name,Income,Tax,Location\n1,A,B,55,23,EMEA\n2,C,D,23,21,EMEA\n3,A,D,34,22,Americas\n4,A,D,45,24,LATAM\n5,A,D,44,25,AP"
dat2 <-",First.Name,Last.Name,Income,Tax,Location\n1,A,B,55,24,EMEA\n2,C,D,55,24,EMEA\n3,A,D,55,24,Americas\n4,A,D,55,24,LATAM\n5,A,D,55,24,AP"

tc1 <- textConnection(dat1)
tc2 <- textConnection(dat2)

merged_file <- do.call(rbind, lapply(list(tc1,tc2), read.csv))

While this works beautifully, I wanted to change lapply to sapply. From the above thread, I realize that sapply would change the read factors from csv file to matrices, but I am unsure why the fields are flipped. For instance, Income field occupies row#3 and row#8, but are not in one column.

Here's the code:

tc1 <- textConnection(dat1)
tc2 <- textConnection(dat2)

# change lapply to sapply    
merged_file <- do.call(rbind, sapply(list(tc1,tc2), read.csv))

Here's the output:

    [,1] [,2] [,3] [,4] [,5]
 [1,]    1    2    1    1    1
 [2,]    1    2    2    2    2
 [3,]   55   23   34   45   44
 [4,]   23   21   22   24   25
 [5,]    3    3    1    4    2
 [6,]    1    2    1    1    1
 [7,]    1    2    2    2    2
 [8,]   55   55   55   55   55
 [9,]   24   24   24   24   24
[10,]    3    3    1    4    2

I'd appreciate any help. I am fairly new to R and not sure what's going on.


回答1:


The issue had nothing to do with factors, it's generic sapply vs lapply. Why does sapply get it so wrong whereas lapply gets it right? Remember in R, dataframes are lists-of-columns. and each column can have a distinct type.

  • lapply returns a list-of-columns to rbind, which does the concatenation correctly. It keeps corresponding columns together. So your factors emerge correctly.
  • sapply however...
    • returns a matrix of numeric... (since matrices can only have one type, unlike dataframes)
    • ...which, worse still, has an unwanted transpose
    • so sapply turns your two 5x6 input dataframes into transposed 6x5 matrices (columns now correspond to rows)...
    • with all data coerced to numeric (garbage!).
    • then rbind row-"concatenates" those two garbage 6x5 matrices of numeric into one very-garbage 12x5 matrix. Since columns have been transposed into rows, row-concatenating the matrices combines datatypes, and obviously your factors are messed up.

Summary: just use lapply



来源:https://stackoverflow.com/questions/39666755/sapply-vs-lapply-while-reading-files-and-rbinding-them

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!