Removing Only Adjacent Duplicates in Data Frame in R

前端未结

关注

 3  756

I have a data frame in R that is supposed to have duplicates. However, there are some duplicates that I would need to remove. In particular, I only want to

相关标签:

3条回答

旧时难觅i

2021-01-12 21:24
Here's an rle solution:
```
df[cumsum(rle(as.character(df$x))$lengths), ]
#    x  y
# 1  A  1
# 2  B  2
# 3  C  3
# 4  A  4
# 5  B  5
# 6  C  6
# 7  A  7
# 9  B  9
# 10 C 10
```
Explanation:

RLE stands for Run Length Encoding. It produces a list of vectors. One being the runs, the values, and the other lengths being the number of consecutive repeats of each value. For example, x <- c(3, 2, 2, 3) has a runs vector of c(3, 2, 3) and lengths c(1, 2, 1). In this example, the cumulative sum of the lengths produces c(1, 3, 4). Subset x with this vector and you get c(3, 2, 3). Note that the second element of the lengths vector is the third element of the vector and the last occurrence of 2 in that particular 'run'.
0 讨论(0)
发布评论:

提交评论
- 加载中...

广开言路

2021-01-12 21:30

You could also try

df[c(diff(as.numeric(df$x)), 1) != 0, ]

In case x is of character class (rather than factor), try

df[c(diff(as.numeric(factor(df$x))), 1) != 0, ]
#    x  y
# 1  A  1
# 2  B  2
# 3  C  3
# 4  A  4
# 5  B  5
# 6  C  6
# 7  A  7
# 9  B  9
# 10 C 10

0 讨论(0)

夕颜

2021-01-12 21:47
Try
```
 df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]
#   x  y
#1  A  1
#2  B  2
#3  C  3
#4  A  4
#5  B  5
#6  C  6
#7  A  7
#9  B  9
#10 C 10
```
Explanation

Here, we are comparing an element with the element preceding it. This can be done by removing the first element from the column and that column compared with the column from which last element is removed (so that the lengths become equal)
```
 df$x[-1] #first element removed
 #[1] B C A B C A B B C
 df$x[-nrow(df)]
  #[1] A B C A B C A B B #last element `C` removed

 df$x[-1]!=df$x[-nrow(df)]
 #[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
```
In the above, the length is 1 less than the nrow of df as we removed one element. Inorder to compensate that, we can concatenate a TRUE and then use this index for subsetting the dataset.
0 讨论(0)
发布评论:

提交评论
- 加载中...

Removing Only Adjacent Duplicates in Data Frame in R

Explanation