问题
I followed Hadley's thread: Issue in Loading multiple .csv files into single dataframe in R using rbind to read multiple CSV
files and then convert them to one dataframe. I also experimented with lapply
vs. sapply
as discussed on Grouping functions (tapply, by, aggregate) and the *apply family.
Here's my first CSV file:
dput(File1)
structure(list(First.Name = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("A",
"C"), class = "factor"), Last.Name = structure(c(1L, 2L, 2L,
2L, 2L), .Label = c("B", "D"), class = "factor"), Income = c(55L,
23L, 34L, 45L, 44L), Tax = c(23L, 21L, 22L, 24L, 25L), Location = structure(c(3L,
3L, 1L, 4L, 2L), .Label = c("Americas", "AP", "EMEA", "LATAM"
), class = "factor")), .Names = c("First.Name", "Last.Name",
"Income", "Tax", "Location"), class = "data.frame", row.names = c(NA,
-5L))
Here's my second CSV file:
dput(File2)
structure(list(First.Name = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("A",
"C"), class = "factor"), Last.Name = structure(c(1L, 2L, 2L,
2L, 2L), .Label = c("B", "D"), class = "factor"), Income = c(55L,
55L, 55L, 55L, 55L), Tax = c(24L, 24L, 24L, 24L, 24L), Location = structure(c(3L,
3L, 1L, 4L, 2L), .Label = c("Americas", "AP", "EMEA", "LATAM"
), class = "factor")), .Names = c("First.Name", "Last.Name",
"Income", "Tax", "Location"), class = "data.frame", row.names = c(NA,
-5L))
Here's my code:
dat1 <-",First.Name,Last.Name,Income,Tax,Location\n1,A,B,55,23,EMEA\n2,C,D,23,21,EMEA\n3,A,D,34,22,Americas\n4,A,D,45,24,LATAM\n5,A,D,44,25,AP"
dat2 <-",First.Name,Last.Name,Income,Tax,Location\n1,A,B,55,24,EMEA\n2,C,D,55,24,EMEA\n3,A,D,55,24,Americas\n4,A,D,55,24,LATAM\n5,A,D,55,24,AP"
tc1 <- textConnection(dat1)
tc2 <- textConnection(dat2)
merged_file <- do.call(rbind, lapply(list(tc1,tc2), read.csv))
While this works beautifully, I wanted to change lapply
to sapply
. From the above thread, I realize that sapply
would change the read factors from csv
file to matrices, but I am unsure why the fields are flipped. For instance, Income
field occupies row#3 and row#8, but are not in one column.
Here's the code:
tc1 <- textConnection(dat1)
tc2 <- textConnection(dat2)
# change lapply to sapply
merged_file <- do.call(rbind, sapply(list(tc1,tc2), read.csv))
Here's the output:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 1 1 1
[2,] 1 2 2 2 2
[3,] 55 23 34 45 44
[4,] 23 21 22 24 25
[5,] 3 3 1 4 2
[6,] 1 2 1 1 1
[7,] 1 2 2 2 2
[8,] 55 55 55 55 55
[9,] 24 24 24 24 24
[10,] 3 3 1 4 2
I'd appreciate any help. I am fairly new to R and not sure what's going on.
回答1:
The issue had nothing to do with factors, it's generic sapply
vs lapply
.
Why does sapply
get it so wrong whereas lapply
gets it right? Remember in R, dataframes are lists-of-columns. and each column can have a distinct type.
lapply
returns a list-of-columns torbind
, which does the concatenation correctly. It keeps corresponding columns together. So your factors emerge correctly.sapply
however...- returns a matrix of numeric... (since matrices can only have one type, unlike dataframes)
- ...which, worse still, has an unwanted transpose
- so
sapply
turns your two 5x6 input dataframes into transposed 6x5 matrices (columns now correspond to rows)... - with all data coerced to numeric (garbage!).
- then
rbind
row-"concatenates" those two garbage 6x5 matrices of numeric into one very-garbage 12x5 matrix. Since columns have been transposed into rows, row-concatenating the matrices combines datatypes, and obviously your factors are messed up.
Summary: just use lapply
来源:https://stackoverflow.com/questions/39666755/sapply-vs-lapply-while-reading-files-and-rbinding-them