find and replace numeric sequence in r

问题

I have a dataframe with a sequence of numbers similar to below:

data <- c(1,1,1,0,0,1,1,2,2,2,0,0,0,2,1,1,0,1,0,2)

What I need is something to locate all instances of 1, 2 or 3 repetitions of 0 where the proceeding and following numbers are identical- i.e. both 1 or both 2 (for example 1,0,1 or 2,0,0,2 but NOT 2,0,1).

Then I need to fill the zeros only with the surrounding value.

I have managed to locate and count consecutive zeros

consec <- (!data) * unlist(lapply(rle(data)$lengths, seq_len))

then I have found the row where these consecutive zeros begin with:

consec <- as.matrix(consec)
first_na <- which(consec==1,arr.ind=TRUE)

But I'm stumped with the replacement process

I would really appreciate your help with this!

Carl

回答1:

Since there seems to be a lot of interest in the answer to this question, I thought I would write up an alternative regular expressions method for posterity.

Using the 'gregexpr' function, you can search out patterns and use the resulting location matches and match lengths to call out which values to change in the original vector. The advantage of using regular expressions is that we can be explicit about exactly which patterns we want to match, and as a result, we won't have any exclusion cases to worry about.

Note: The following example works as written, because we are assuming single-digit values. We could easily adapt it for other patterns, but we can take a small shortcut with single characters. If we wanted to do this with possible multiple-digit values, we would want to add a separation character as part of the first concatenation ('paste') function.

The Code

str.values <- paste(data, collapse="") # String representation of vector
str.matches <- gregexpr("1[0]{1,3}1", str.values) # Pattern 101/1001/10001
data[eval(parse(text=paste("c(",paste(str.matches[[1]] + 1, str.matches[[1]] - 2 + attr(str.matches[[1]], "match.length"), sep=":", collapse=","), ")")))] <- 1 # Replace zeros with ones
str.matches <- gregexpr("2[0]{1,3}2", str.values) # Pattern 202/2002/20002
data[eval(parse(text=paste("c(",paste(str.matches[[1]] + 1, str.matches[[1]] - 2 + attr(str.matches[[1]], "match.length"), sep=":", collapse=","), ")")))] <- 2 # Replace zeros with twos

Step 1: Make a single string of all the data values.

str.values <- paste(data, collapse="")
# "11100112220002110102"

This collapses down the data into one long string, so we can use a regular expression on it.

Step 2: Apply a regular expression to find the locations and lengths of any matches within the string.

str.matches <- gregexpr("1[0]{1,3}1", str.values)
# [[1]]
# [1]  3 16
# attr(,"match.length")
# [1] 4 3
# attr(,"useBytes")
# [1] TRUE

In this case, we're using a regular expression to look for the first pattern, one to three zeros ([0]{2,}) with ones on either side (1[0]{1,3}1). We will have to match the entire pattern, in order to prevent having to check for matching ones or twos on the ends. We'll subtract those ends off in the next step.

Step 3: Write ones into all the matching locations in the original vector.

data[eval(parse(text=paste("c(",paste(str.matches[[1]] + 1, str.matches[[1]] - 2 + attr(str.matches[[1]], "match.length"), sep=":", collapse=","), ")")))] <- 1
# 1 1 1 1 1 1 1 2 2 2 0 0 0 2 1 1 1 1 0 2

We're doing a few steps all at once here. First, we are creating a list of number sequences from the numbers that matched in the regular expression. In this case, there are two matches, which start at indexes 3 and 16 and are 4 and 3 items long, respectively. This means our zeros are located at indexes (3+1):(3-2+4), or 4:5 and at (16+1):(16-2+3), or 17:17. We concatenate ('paste') these sequences using the 'collapse' option again, in case there are multiple matches. Then, we use a second concatenation to put the sequences inside of a combine (c()) function. Using the 'eval' and 'parse' functions, we turn this text into code and pass it as index values to the [data] array. We write all ones into those locations.

Step x: Repeat for each pattern. In this case, we need to do a second search and find one to three zeros with twos on either side and then run the same statement as Step 3, but assigning twos, instead of ones.

str.matches <- gregexpr("2[0]{1,3}2", str.values)
# [[1]]
# [1] 10
# attr(,"match.length")
# [1] 5
# attr(,"useBytes")
# [1] TRUE

data[eval(parse(text=paste("c(",paste(str.matches[[1]] + 1, str.matches[[1]] - 2 + attr(str.matches[[1]], "match.length"), sep=":", collapse=","), ")")))] <- 2
# 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 0 2

Update: I realized the original problem said to match one to three zeros in a row, rather than the "two or more" that I written into the original code. I have updated the regular expressions and the explanation, although the code remains the same.

回答2:

Here is a loopless solution using rle() and inverse.rle().

data <- c(1,1,1,0,0,1,1,2,2,2,0,0,0,2,1,1,0,1,0,2)

local({
  r <- rle(data)
  x <- r$values
  x0 <- which(x==0) # index positions of zeroes
  xt <- x[x0-1]==x[x0+1] # zeroes surrounded by same value
  r$values[x0[xt]] <- x[x0[xt]-1] # substitute with surrounding value
  inverse.rle(r)
})

[1] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 0 2

PS. I use local() as a simple mechanism to not clobber the workspace with loads of new temporary objects. You could create a function instead of using local - I just find I use local a lot nowadays for this type of task.

PPS. You will have to modify this code to exclude leading or trailing zeroes in your original data.

回答3:

There may be a solution without a for loop, but you can try this :

tmp <- rle(data)
val <- tmp$values
for (i in 2:(length(val)-1)) {
  if (val[i]==0 & val[i-1]==val[i+1]) val[i] <- val[i-1]
}
tmp$values <- val
inverse.rle(tmp)

Which gives :

[1] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 0 2

回答4:

For those who are looking into this in 2020 I did a sequence replacing by just using gsub.

str.values <- paste(YOUR$COLUMN, collapse="") 
str.values2 <- gsub("ORIGINAL PATTERN","PATTERN TO REPLACE", str.values)

来源：https://stackoverflow.com/questions/15067073/find-and-replace-numeric-sequence-in-r

标签

replace

Sequence