How to partition a vector into groups of regular, consecutive sequences?

I have a vector, such as c(1, 3, 4, 5, 9, 10, 17, 29, 30) and I would like to group together the 'neighboring' elements that form a regular, consecutive sequence in a ragged vector resulting in:

L1: 1
L2: 3,4,5
L3: 9,10
L4: 17
L5: 29,30

Naive code (of an ex-C programmer):

partition.neighbors <- function(v)
{
    result <<- list() #jagged array
    currentList <<- v[1] #current series

    for(i in 2:length(v))
    {
        if(v[i] - v [i-1] == 1)
        {
            currentList <<- c(currentList, v[i])
        }
        else
        {
            result <<- c(result, list(currentList))
            currentList <<- v[i] #next series
        }       
    }

    return(result)  
}

Now I understand that

a) R is not C (despite the curly brackets)
b) global variables are pure evil
c) that is a horribly inefficient way of achieving the result

, so any better solutions are welcome.

Joshua Ulrich

Making heavy use of some R idioms:

> split(v, cumsum(c(1, diff(v) != 1)))
$`1`
[1] 1

$`2`
[1] 3 4 5

$`3`
[1]  9 10

$`4`
[1] 17

$`5`
[1] 29 30

Aaron

daroczig writes "you could write a lot neater code based on diff"...

Here's one way:

split(v, cumsum(diff(c(-Inf, v)) != 1))

EDIT (added timings):

Tommy discovered this could be faster by being careful with types; the reason it got faster is that split is faster on integers, and is actually faster still on factors.

Here's Joshua's solution; the result from the cumsum is a numeric because it's being c'd with 1, so it's the slowest.

system.time({
a <- cumsum(c(1, diff(v) != 1))
split(v, a)
})
#   user  system elapsed 
#  1.839   0.004   1.848

Just cing with 1L so the result is an integer speeds it up considerably.

system.time({
a <- cumsum(c(1L, diff(v) != 1))
split(v, a)
})
#   user  system elapsed 
#  0.744   0.000   0.746

This is Tommy's solution, for reference; it's also splitting on an integer.

> system.time({
a <- cumsum(c(TRUE, diff(v) != 1L))
split(v, a)
})
#   user  system elapsed 
#  0.742   0.000   0.746

Here's my original solution; it also is splitting on an integer.

system.time({
a <- cumsum(diff(c(-Inf, v)) != 1)
split(v, a)
})
#   user  system elapsed 
#  0.750   0.000   0.754

Here's Joshua's, with the result converted to an integer before the split.

system.time({
a <- cumsum(c(1, diff(v) != 1))
a <- as.integer(a)
split(v, a)
})
#   user  system elapsed 
#  0.736   0.002   0.740

All the versions that split on an integer vector are about the same; it could be even faster if that integer vector was already a factor, as the conversion from integer to factor actually takes about half the time. Here I make it into a factor directly; this is not recommended in general because it depends on the structure of the factor class. It'ss done here for comparison purposes only.

system.time({
a <- cumsum(c(1L, diff(v) != 1))
a <- structure(a, class = "factor", levels = 1L:a[length(a)])
split(v,a)
})
#   user  system elapsed 
#  0.356   0.000   0.357

Tommy

Joshua and Aaron were spot on. However, their code can still be made more than twice as fast by careful use of the correct types, integers and logicals:

split(v, cumsum(c(TRUE, diff(v) != 1L)))

v <- rep(c(1:5, 19), len = 1e6) # Huge vector...
system.time( split(v, cumsum(c(1, diff(v) != 1))) ) # Joshua's code
# user  system elapsed 
#   2.64    0.00    2.64 

system.time( split(v, cumsum(c(TRUE, diff(v) != 1L))) ) # Modified code
# user  system elapsed 
# 1.09    0.00    1.12

James

You can create a data.frame and assign the elements to groups using diff, ifelse and cumsum, then aggregate using tapply:

v.df <- data.frame(v = v)
v.df$group <- cumsum(ifelse(c(1, diff(v) - 1), 1, 0))
tapply(v.df$v, v.df$group, function(x) x)

$`1`
[1] 1

$`2`
[1] 3 4 5

$`3`
[1]  9 10

$`4`
[1] 17

$`5`
[1] 29 30

You could define the cut-points easily:

which(diff(v) != 1)

Based on that try:

v <- c(1,3,4,5,9,10,17,29,30)
cutpoints <- c(0, which(diff(v) != 1), length(v))
ragged.vector <- vector("list", length(cutpoints)-1)
for (i in 2:length(cutpoints)) ragged.vector[[i-1]] <- v[(cutpoints[i-1]+1):cutpoints[i]]

Which results in:

> ragged.vector
[[1]]
[1] 1

[[2]]
[1] 3 4 5

[[3]]
[1]  9 10

[[4]]
[1] 17

[[5]]
[1] 29 30

This algorithm is not a nice one but you could write a lot neater code based on diff :) Good luck!

来源：https://stackoverflow.com/questions/5222061/how-to-partition-a-vector-into-groups-of-regular-consecutive-sequences

标签

vector

data-partitioning