问题
I have a data frame as follows
Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B 23
3 43 A 10
3 43 B 17
3 43 A 18
3 43 B 20
3 43 C 25
3 43 A 30
I’d like to re-cast it with a single row for each Identifier and one column for each value in the current location column. I don’t care about the data in V1 but I need the data in V2 and these will become the values in the new columns.
Note that for the Location column there are repeated values for Identifiers 2 and 3.
I ASSUME that the first task is to make the values in the Location column unique.
I used the following (the data frame is called “Test”)
L<-length(Test$Identifier)
for (i in 1:L)
{
temp<-Test$Location[Test$Identifier==i]
temp1<-make.unique(as.character(temp), sep="-")
levels(Test$Location)=c(levels(Test$Location),temp1)
Test$Location[Test$Identifier==i]=temp1
}
This produces
Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B-1 23
3 43 A 10
3 43 B 17
3 43 A-1 18
3 43 B-1 20
3 43 C 25
3 50 A-2 30
Then using
cast(Test, Identifier ~ Location)
gives
Identifier A B C B-1 A-1 A-2
1 21 24 NA NA NA NA
2 NA 15 18 23 NA NA
3 10 17 25 20 18 30
And this is more or less what I want.
My questions are
Is this the right way to handle the problem?
I know R-people don’t use the “for” construction so is there a more R-elegant (relegant?) way to do this? I should mention that the real data set has over 160,000 rows and starts with over 50 unique values in the Location vector and the function takes just over an hour to run. Anything quicker would be good. I should also mention that the cast function had to be run on 20-30k rows of the output at a time despite increasing the memory limit. All the cast outputs were then merged
Is there a way to sort the columns in the output so that (here) they are A, A-1, A-2, B, B-1, C
Please be gentle with your replies!
回答1:
Usually your original format is much better than your desired result. However, you can do this easily using the split-apply-combine approach, e.g., with package plyr:
DF <- read.table(text="Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B 23
3 43 A 10
3 43 B 17
3 43 A 18
3 43 B 20
3 43 C 25
3 43 A 30", header=TRUE, stringsAsFactors=FALSE)
#note that I make sure that there are only characters and not factors
#use as.character if you have factors
library(plyr)
DF <- ddply(DF, .(Identifier), transform, Loc2 = make.unique(Location, sep="-"))
library(reshape2)
DFwide <- dcast(DF, Identifier ~Loc2, value.var="V2")
# Identifier A B B-1 C A-1 A-2
#1 1 21 24 NA NA NA NA
#2 2 NA 15 23 18 NA NA
#3 3 10 17 20 25 18 30
If column order is important to you (usually it isn't):
DFwide[, c(1, order(names(DFwide)[-1])+1)]
# Identifier A A-1 A-2 B B-1 C
#1 1 21 NA NA 24 NA NA
#2 2 NA NA NA 15 23 18
#3 3 10 18 30 17 20 25
回答2:
For reference, here's the equivalent of @Roland's answer in base R.
Use ave
to create the unique "Location" columns....
DF$Location <- with(DF, ave(Location, Identifier,
FUN = function(x) make.unique(x, sep = "-")))
... and reshape
to change the structure of your data.
## If you want both V1 and V2 in your "wide" dataset
## "dcast" can't directly do this--you'll need `recast` if you
## wanted both columns, which first `melt`s and then `dcast`s....
reshape(DF, direction = "wide", idvar = "Identifier", timevar = "Location")
## If you only want V2, as you indicate in your question
reshape(DF, direction = "wide", idvar = "Identifier",
timevar = "Location", drop = "V1")
# Identifier V2.A V2.B V2.C V2.B-1 V2.A-1 V2.A-2
# 1 1 21 24 NA NA NA NA
# 3 2 NA 15 18 23 NA NA
# 6 3 10 17 25 20 18 30
Reordering the columns can be done the same way that @Roland suggested.
来源:https://stackoverflow.com/questions/23980046/forcing-unique-values-before-casting-pivoting-in-r