how to remove partial duplicates from a data frame?

后端未结

关注

 2  841

Data I\'m importing describes numeric measurements taken at various locations for more or less evenly spread timestamps. sometimes this \"evenly spread\" is not really true

相关标签:

2条回答

迷失自我

2020-12-14 05:03

I would use subset combined with duplicated to filter non-unique timestamps in the second data frame:

R> df_ <- read.table(textConnection('
                     ts         v
1 "2009-09-30 10:00:00" -2.081609
2 "2009-09-30 10:15:00" -2.079778
3 "2009-09-30 10:15:00" -2.113531
4 "2009-09-30 10:15:00" -2.124716
5 "2009-09-30 10:15:00" -2.102117
6 "2009-09-30 10:30:00" -2.093542
7 "2009-09-30 10:30:00" -2.092626
8 "2009-09-30 10:45:00" -2.086339
9 "2009-09-30 11:00:00" -2.080144
'), as.is=TRUE, header=TRUE)

R> subset(df_, !duplicated(ts))
                   ts      v
1 2009-09-30 10:00:00 -2.082
2 2009-09-30 10:15:00 -2.080
6 2009-09-30 10:30:00 -2.094
8 2009-09-30 10:45:00 -2.086
9 2009-09-30 11:00:00 -2.080

Update: To select a specific value you can use aggregate

aggregate(df_$v, by=list(df_$ts), function(x) x[1])  # first value
aggregate(df_$v, by=list(df_$ts), function(x) tail(x, n=1))  # last value
aggregate(df_$v, by=list(df_$ts), function(x) max(x))  # max value

0 讨论(0)

礼貌的吻别

2020-12-14 05:04

I think you are looking at data structures for time-indexed objects, and not for a dictionary. For the former, look at the zoo and xts packages which offer much better time-pased subsetting:

R> library(xts)
R> X <- xts(data.frame(val=rnorm(10)), \
            order.by=Sys.time() + sort(runif(10,10,300)))
R> X
                        val
2009-11-20 07:06:17 -1.5564
2009-11-20 07:06:40 -0.2960
2009-11-20 07:07:50 -0.4123
2009-11-20 07:08:18 -1.5574
2009-11-20 07:08:45 -1.8846
2009-11-20 07:09:47  0.4550
2009-11-20 07:09:57  0.9598
2009-11-20 07:10:11  1.0018
2009-11-20 07:10:12  1.0747
2009-11-20 07:10:58  0.7062
R> X["2009-11-20 07:08::2009-11-20 07:09"]
                        val
2009-11-20 07:08:18 -1.5574
2009-11-20 07:08:45 -1.8846
2009-11-20 07:09:47  0.4550
2009-11-20 07:09:57  0.9598
R>

The X object is ordered by a time sequence -- make sure it is of type POSIXct so you may need to parse your dates first. Then we can just index for '7:08 to 7:09 on the give day'.

0 讨论(0)