Renaming duplicate strings in R

前端未结

关注

 4  1664

I have an R dataframe that has two columns of strings. In one of the columns (say, Column1) there are duplicate values. I need to relabel that column so that it would have t

相关标签:

4条回答

旧巷少年郎

2021-01-14 18:01

d <- read.table(text='Column1   Column2  
 1         A 
 1         B 
 2         C 
 2         D 
 3         E 
 4         F', header=TRUE)

transform(d, 
    Column1.new = ifelse(duplicated(Column1) | duplicated(Column1, fromLast=TRUE), 
                         paste(Column1, ave(Column1, Column1, FUN=seq_along), sep='_'), 
                         Column1))

#   Column1 Column2 Column1.new
# 1       1       A         1_1
# 2       1       B         1_2
# 3       2       C         2_1
# 4       2       D         2_2
# 5       3       E           3
# 6       4       F           4

0 讨论(0)

暖寄归人

2021-01-14 18:05
Let's say your data (ordered by Column1) is within an object called tab. First create a run length object
```
c1.rle <- rle(tab$Column1)
c1.rle
##lengths: int [1:4] 2 2 1 1
##values : int [1:4] 1 2 3 4
```
That gives you values of Column1 and the according number of appearences of each element. Then use that information to create the new column with unique identifiers:
```
tab$Column1.new <- paste0(rep(c1.rle$values, times = c1.rle$lengths), "_",
        unlist(lapply(c1.rle$lengths, seq_len)))
```
Not sure, if this is appropriate in your situation, but you could also just paste together Column1 and Column2, to create an unique identifier...
0 讨论(0)
发布评论:

提交评论
- 加载中...

栀梦

2021-01-14 18:17

@Cão answer only with base R:

x=read.table(text="
Column1   Column2   #Column1.new
1         A         #1_1
1         B         #1_2
2         C         #2_1
2         D         #2_2
3         E         #3
4         F         #4", stringsAsFactors=F, header=T)

string<-x$Column1
mstring <- make.unique(as.character(string) )
mstring<-sub("(.*)(\\.)([0-9]+)","\\1_\\3",mstring)
y <- rle(string)
tmp <- !duplicated(string) & (string %in% y$values[y$lengths>1])
mstring[tmp]<-gsub("(.*)","\\1_0", mstring[tmp]) 
end <- sub(".*_([0-9]+)","\\1",grep("_([0-9]*)$",mstring,value=T) ) 
beg <- sub("(.*_)[0-9]+","\\1",grep("_([0-9]*)$",mstring,value=T) ) 
newend <- as.numeric(end)+1
mstring[grep("_([0-9]*)$",mstring)]<-paste0(beg,newend)
x$Column1New<-mstring
x

0 讨论(0)

失恋的感觉

2021-01-14 18:21
May be a little more of a workaround, but parts of this may be more useful and simpler for someone with not quite the same needs. make.names with the unique=T attribute adds a dot and numbers names that are repeated:
```
x <- make.names(tab$Column1,unique=T)
> print(x)
[1] "X1"   "X1.1" "X2"   "X2.1" "X3"   "X4"   
```
This might be enough for some folks. Here you can then grab the first entries of elements that are repeated, but not elements that are not repeated, then add a .0 to the end.
```
y <- rle(tab$Column1)
tmp <- !duplicated(tab$Column1) & (tab$Column1 %in% y$values[y$lengths>1])
x[tmp] <- str_replace(x[tmp],"$","\\.0")
> print(x)
[1] "X1.0" "X1.1" "X2.0" "X2.1" "X3"   "X4"
```
Replace the dots and remove the X
```
x <- str_replace(x,"X","")
x <- str_replace(x,"\\.","_")
> print(x)
[1] "1_0" "1_1" "2_0" "2_1" "3"   "4" 
```
Might be good enough for you. But if you want the indexing to start at 1, grab the numbers, add one then put them back.
```
z <- str_match(x,"_([0-9]*)$")[,2]
z <- as.character(as.numeric(z)+1)
x <- str_replace(x,"_([0-9]*)$",paste0("_",z))
> print(x)
[1] "1_1" "1_2" "2_1" "2_2" "3"   "4" 
```
Like I said, more of a workaround here, but gives some options.
0 讨论(0)
发布评论:

提交评论
- 加载中...