Forcing unique values before casting (pivoting) in R

旧巷老猫 提交于 2019-12-11 09:29:41

问题


I have a data frame as follows

Identifier  V1  Location    V2
1   12  A   21
1   12  B   24
2   20  B   15
2   20  C   18
2   20  B   23
3   43  A   10
3   43  B   17
3   43  A   18
3   43  B   20
3   43  C   25
3   43  A   30

I’d like to re-cast it with a single row for each Identifier and one column for each value in the current location column. I don’t care about the data in V1 but I need the data in V2 and these will become the values in the new columns.

Note that for the Location column there are repeated values for Identifiers 2 and 3.

I ASSUME that the first task is to make the values in the Location column unique.

I used the following (the data frame is called “Test”)

L<-length(Test$Identifier)
for (i in 1:L) 
{
temp<-Test$Location[Test$Identifier==i]
temp1<-make.unique(as.character(temp), sep="-")
levels(Test$Location)=c(levels(Test$Location),temp1)
Test$Location[Test$Identifier==i]=temp1
}

This produces

Identifier  V1  Location    V2
1   12  A   21
1   12  B   24
2   20  B   15
2   20  C   18
2   20  B-1 23
3   43  A   10
3   43  B   17
3   43  A-1 18
3   43  B-1 20
3   43  C   25
3   50  A-2 30

Then using

cast(Test, Identifier ~ Location)

gives

Identifier  A   B   C   B-1 A-1 A-2
1   21  24  NA  NA  NA  NA
2   NA  15  18  23  NA  NA
3   10  17  25  20  18  30

And this is more or less what I want.

My questions are

Is this the right way to handle the problem?

I know R-people don’t use the “for” construction so is there a more R-elegant (relegant?) way to do this? I should mention that the real data set has over 160,000 rows and starts with over 50 unique values in the Location vector and the function takes just over an hour to run. Anything quicker would be good. I should also mention that the cast function had to be run on 20-30k rows of the output at a time despite increasing the memory limit. All the cast outputs were then merged

Is there a way to sort the columns in the output so that (here) they are A, A-1, A-2, B, B-1, C

Please be gentle with your replies!


回答1:


Usually your original format is much better than your desired result. However, you can do this easily using the split-apply-combine approach, e.g., with package plyr:

DF <- read.table(text="Identifier  V1  Location    V2
1   12  A   21
1   12  B   24
2   20  B   15
2   20  C   18
2   20  B   23
3   43  A   10
3   43  B   17
3   43  A   18
3   43  B   20
3   43  C   25
3   43  A   30", header=TRUE, stringsAsFactors=FALSE)
#note that I make sure that there are only characters and not factors
#use as.character if you have factors

library(plyr)
DF <- ddply(DF, .(Identifier), transform, Loc2 = make.unique(Location, sep="-"))

library(reshape2)
DFwide <- dcast(DF, Identifier ~Loc2, value.var="V2")
#  Identifier  A  B B-1  C A-1 A-2
#1          1 21 24  NA NA  NA  NA
#2          2 NA 15  23 18  NA  NA
#3          3 10 17  20 25  18  30

If column order is important to you (usually it isn't):

DFwide[, c(1, order(names(DFwide)[-1])+1)]
#  Identifier  A A-1 A-2  B B-1  C
#1          1 21  NA  NA 24  NA NA
#2          2 NA  NA  NA 15  23 18
#3          3 10  18  30 17  20 25



回答2:


For reference, here's the equivalent of @Roland's answer in base R.

Use ave to create the unique "Location" columns....

DF$Location <- with(DF, ave(Location, Identifier, 
                    FUN = function(x) make.unique(x, sep = "-")))

... and reshape to change the structure of your data.

## If you want both V1 and V2 in your "wide" dataset
## "dcast" can't directly do this--you'll need `recast` if you 
##    wanted both columns, which first `melt`s and then `dcast`s....
reshape(DF, direction = "wide", idvar = "Identifier", timevar = "Location")

## If you only want V2, as you indicate in your question
reshape(DF, direction = "wide", idvar = "Identifier", 
        timevar = "Location", drop = "V1")
#   Identifier V2.A V2.B V2.C V2.B-1 V2.A-1 V2.A-2
# 1          1   21   24   NA     NA     NA     NA
# 3          2   NA   15   18     23     NA     NA
# 6          3   10   17   25     20     18     30

Reordering the columns can be done the same way that @Roland suggested.



来源:https://stackoverflow.com/questions/23980046/forcing-unique-values-before-casting-pivoting-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!