R enumerate duplicates in a dataframe with unique value

问题

I have a dataframe containing a set of parts and test results. The parts are tested on 3 sites (North Centre and South). Sometimes those parts are re-tested. I want to eventually create some charts that compare the results from the first time that a part was tested with the second (or third, etc.) time that it was tested, e.g. to look at tester repeatability.

As an example, I've come up with the below code. ~~I've explicitly removed the "Experiment" column from the morley data set, as this is the column I'm effectively trying to recreate.~~ The code works, however it seems that there must be a more elegant way to approach this problem. Any thoughts?

Edit - I realise that the example given was overly simplistic for my actual needs (I was trying to generate a reproducible example as easily as possible).

New example:

part<-as.factor(c("A","A","A","B","B","B","A","A","A","C","C","C"))
site<-as.factor(c("N","C","S","C","N","S","N","C","S","N","S","C"))
result<-c(17,20,25,51,50,49,43,45,47,52,51,56)

data<-data.frame(part,site,result)
data$index<-1
repeat {
    if(!anyDuplicated(data[,c("part","site","index")]))
    { break }
    data$index<-ifelse(duplicated(data[,1:2]),data$index+1,data$index)
}
data

      part site result index
1     A    N     17     1
2     A    C     20     1
3     A    S     25     1
4     B    C     51     1
5     B    N     50     1
6     B    S     49     1
7     A    N     43     2
8     A    C     45     2
9     A    S     47     2
10    C    N     52     1
11    C    S     51     1
12    C    C     56     1

Old example:

#Generate a trial data frame from the morley dataset
df<-morley[,c(2,3)]

#Set up an iterative variable
#Create the index column and initialise to 1
df$index<-1

# Loop through the dataframe looking for duplicate pairs of
# Runs and Indices and increment the index if it's a duplicate
repeat {
    if(!anyDuplicated(df[,c(1,3)]))
    { break }
    df$index<-ifelse(duplicated(df[,c(1,3)]),df$index+1,df$index)
}

# Check - The below vector should all be true
df$index==morley$Expt

回答1:

We may use diff and cumsum on the 'Run' column to get the expected output. In this method, we are not creating a column of 1s i.e 'index' and also assuming that the sequence in 'Run' is ordered as showed in the OP's example.

indx <- cumsum(c(TRUE,diff(df$Run)<0))
identical(indx, morley$Expt)
#[1] TRUE

Or we can use ave

indx2 <- with(df, ave(Run, Run, FUN=seq_along))
identical(indx2, morley$Expt)
#[1] TRUE

Update

Using the new example

with(data, ave(seq_along(part), part, site, FUN=seq_along))
#[1] 1 1 1 1 1 1 2 2 2 1 1 1

Or we can use getanID from library(splitstackshape)

library(splitstackshape)
getanID(data, c('part', 'site'))[]

回答2:

I think this is a job for make.unique, with some manipulation.

index <- 1L + as.integer(sub("\\d+(\\.)?","",make.unique(as.character(morley$Run))))
index <- ifelse(is.na(index),1L,index)
identical(index,morley$Expt)
[1] TRUE

回答3:

Details of your actual data.frame may matter. However, a couple of options working with your example:

#this works if each group starts with 1:
df$index<-cumsum(df$Run==1)
#this is maybe more general, with data.table
require(data.table)
dt<-as.data.table(df)
dt[,index:=seq_along(Speed),by=Run]

来源：https://stackoverflow.com/questions/32393814/r-enumerate-duplicates-in-a-dataframe-with-unique-value

标签

duplicates