Renaming duplicate strings in R

前端 未结 4 1664
情歌与酒
情歌与酒 2021-01-14 17:31

I have an R dataframe that has two columns of strings. In one of the columns (say, Column1) there are duplicate values. I need to relabel that column so that it would have t

相关标签:
4条回答
  • 2021-01-14 18:01
    d <- read.table(text='Column1   Column2  
     1         A 
     1         B 
     2         C 
     2         D 
     3         E 
     4         F', header=TRUE)
    
    transform(d, 
        Column1.new = ifelse(duplicated(Column1) | duplicated(Column1, fromLast=TRUE), 
                             paste(Column1, ave(Column1, Column1, FUN=seq_along), sep='_'), 
                             Column1))
    
    #   Column1 Column2 Column1.new
    # 1       1       A         1_1
    # 2       1       B         1_2
    # 3       2       C         2_1
    # 4       2       D         2_2
    # 5       3       E           3
    # 6       4       F           4
    
    0 讨论(0)
  • 2021-01-14 18:05

    Let's say your data (ordered by Column1) is within an object called tab. First create a run length object

    c1.rle <- rle(tab$Column1)
    c1.rle
    ##lengths: int [1:4] 2 2 1 1
    ##values : int [1:4] 1 2 3 4
    

    That gives you values of Column1 and the according number of appearences of each element. Then use that information to create the new column with unique identifiers:

    tab$Column1.new <- paste0(rep(c1.rle$values, times = c1.rle$lengths), "_",
            unlist(lapply(c1.rle$lengths, seq_len)))
    

    Not sure, if this is appropriate in your situation, but you could also just paste together Column1 and Column2, to create an unique identifier...

    0 讨论(0)
  • 2021-01-14 18:17

    @Cão answer only with base R:

    x=read.table(text="
    Column1   Column2   #Column1.new
    1         A         #1_1
    1         B         #1_2
    2         C         #2_1
    2         D         #2_2
    3         E         #3
    4         F         #4", stringsAsFactors=F, header=T)
    
    string<-x$Column1
    mstring <- make.unique(as.character(string) )
    mstring<-sub("(.*)(\\.)([0-9]+)","\\1_\\3",mstring)
    y <- rle(string)
    tmp <- !duplicated(string) & (string %in% y$values[y$lengths>1])
    mstring[tmp]<-gsub("(.*)","\\1_0", mstring[tmp]) 
    end <- sub(".*_([0-9]+)","\\1",grep("_([0-9]*)$",mstring,value=T) ) 
    beg <- sub("(.*_)[0-9]+","\\1",grep("_([0-9]*)$",mstring,value=T) ) 
    newend <- as.numeric(end)+1
    mstring[grep("_([0-9]*)$",mstring)]<-paste0(beg,newend)
    x$Column1New<-mstring
    x
    
    0 讨论(0)
  • 2021-01-14 18:21

    May be a little more of a workaround, but parts of this may be more useful and simpler for someone with not quite the same needs. make.names with the unique=T attribute adds a dot and numbers names that are repeated:

    x <- make.names(tab$Column1,unique=T)
    > print(x)
    [1] "X1"   "X1.1" "X2"   "X2.1" "X3"   "X4"   
    

    This might be enough for some folks. Here you can then grab the first entries of elements that are repeated, but not elements that are not repeated, then add a .0 to the end.

    y <- rle(tab$Column1)
    tmp <- !duplicated(tab$Column1) & (tab$Column1 %in% y$values[y$lengths>1])
    x[tmp] <- str_replace(x[tmp],"$","\\.0")
    > print(x)
    [1] "X1.0" "X1.1" "X2.0" "X2.1" "X3"   "X4"
    

    Replace the dots and remove the X

    x <- str_replace(x,"X","")
    x <- str_replace(x,"\\.","_")
    > print(x)
    [1] "1_0" "1_1" "2_0" "2_1" "3"   "4" 
    

    Might be good enough for you. But if you want the indexing to start at 1, grab the numbers, add one then put them back.

    z <- str_match(x,"_([0-9]*)$")[,2]
    z <- as.character(as.numeric(z)+1)
    x <- str_replace(x,"_([0-9]*)$",paste0("_",z))
    > print(x)
    [1] "1_1" "1_2" "2_1" "2_2" "3"   "4" 
    

    Like I said, more of a workaround here, but gives some options.

    0 讨论(0)
提交回复
热议问题