R enumerate duplicates in a dataframe with unique value

只愿长相守 提交于 2021-02-08 19:39:27

问题


I have a dataframe containing a set of parts and test results. The parts are tested on 3 sites (North Centre and South). Sometimes those parts are re-tested. I want to eventually create some charts that compare the results from the first time that a part was tested with the second (or third, etc.) time that it was tested, e.g. to look at tester repeatability.

As an example, I've come up with the below code. I've explicitly removed the "Experiment" column from the morley data set, as this is the column I'm effectively trying to recreate. The code works, however it seems that there must be a more elegant way to approach this problem. Any thoughts?

Edit - I realise that the example given was overly simplistic for my actual needs (I was trying to generate a reproducible example as easily as possible).

New example:

part<-as.factor(c("A","A","A","B","B","B","A","A","A","C","C","C"))
site<-as.factor(c("N","C","S","C","N","S","N","C","S","N","S","C"))
result<-c(17,20,25,51,50,49,43,45,47,52,51,56)

data<-data.frame(part,site,result)
data$index<-1
repeat {
    if(!anyDuplicated(data[,c("part","site","index")]))
    { break }
    data$index<-ifelse(duplicated(data[,1:2]),data$index+1,data$index)
}
data

      part site result index
1     A    N     17     1
2     A    C     20     1
3     A    S     25     1
4     B    C     51     1
5     B    N     50     1
6     B    S     49     1
7     A    N     43     2
8     A    C     45     2
9     A    S     47     2
10    C    N     52     1
11    C    S     51     1
12    C    C     56     1

Old example:

#Generate a trial data frame from the morley dataset
df<-morley[,c(2,3)]

#Set up an iterative variable
#Create the index column and initialise to 1
df$index<-1

# Loop through the dataframe looking for duplicate pairs of
# Runs and Indices and increment the index if it's a duplicate
repeat {
    if(!anyDuplicated(df[,c(1,3)]))
    { break }
    df$index<-ifelse(duplicated(df[,c(1,3)]),df$index+1,df$index)
}

# Check - The below vector should all be true
df$index==morley$Expt

回答1:


We may use diff and cumsum on the 'Run' column to get the expected output. In this method, we are not creating a column of 1s i.e 'index' and also assuming that the sequence in 'Run' is ordered as showed in the OP's example.

indx <- cumsum(c(TRUE,diff(df$Run)<0))
identical(indx, morley$Expt)
#[1] TRUE

Or we can use ave

indx2 <- with(df, ave(Run, Run, FUN=seq_along))
identical(indx2, morley$Expt)
#[1] TRUE

Update

Using the new example

with(data, ave(seq_along(part), part, site, FUN=seq_along))
#[1] 1 1 1 1 1 1 2 2 2 1 1 1

Or we can use getanID from library(splitstackshape)

library(splitstackshape)
getanID(data, c('part', 'site'))[]



回答2:


I think this is a job for make.unique, with some manipulation.

index <- 1L + as.integer(sub("\\d+(\\.)?","",make.unique(as.character(morley$Run))))
index <- ifelse(is.na(index),1L,index)
identical(index,morley$Expt)
[1] TRUE



回答3:


Details of your actual data.frame may matter. However, a couple of options working with your example:

#this works if each group starts with 1:
df$index<-cumsum(df$Run==1)
#this is maybe more general, with data.table
require(data.table)
dt<-as.data.table(df)
dt[,index:=seq_along(Speed),by=Run]


来源:https://stackoverflow.com/questions/32393814/r-enumerate-duplicates-in-a-dataframe-with-unique-value

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!