问题
I try to extract passages like 44.11.36.00-1
(precisely, nn.nn.nn.nn-n
, where n
stands for any number from 0-9) from text in R.
I want to extract passages if they are "sticked" to non-number marks:
44.11.36.00-1
extracted fromnsfghstighsl44.11.36.00-1vsdfgh
is OK44.11.36.00-1
extracted fromfa0044.11.36.00-1000
is NOT
I have read that str_extract_all
is not working with Lookbehind
and Lookahead
expressions, so I sadly came back to grep
, but cannot deal with it:
> pattern1 <- "(?<![0-9]{1})[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}-[0-9]{1}(?![0-9]{1})"
> grep(pattern1, "dyj44.11.36.00-1aregjspotgji 44113600-1 agdtklj441136001 ", perl=TRUE, value = TRUE)
[1] "dyj44.11.36.00-1aregjspotgji 44113600-1 agdtklj441136001 "
which is not the result I expected.
I thought that:
(?<![0-9]{1})
means "match expression which is not preceeded by a number"[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}-[0-9]{1}
stands for the expression I seek for(?![0-9]{1})
means "match expression which is not followed by a number"
回答1:
AS @Roland said in his comment, you need to use regmatches
instead of grep
> s <- "nsfghstighsl44.11.36.00-1vsdfgh"
> m <- gregexpr("(?<![0-9]{1})[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}-[0-9]{1}(?![0-9]{1})", s, perl=TRUE)
> regmatches(s, m)
[1] "44.11.36.00-1"
A reduced one,
> x <- c('nsfghstighsl44.11.36.00-1vsdfgh', 'fa0044.11.36.00-1000')
> m <- gregexpr("(?<!\\d)\\d{2}\\.\\d{2}\\.\\d{2}\\.\\d{2}-\\d(?!\\d)", x, perl=TRUE)
> regmatches(x, m)
[1] "44.11.36.00-1"
回答2:
You don't actually need lookahead or lookbehind with this approach. Just parenthesize the portion you want extracted:
library(gsubfn)
x <- c("nsfghstighsl44.11.36.00-1vsdfgh", "fa0044.11.36.00-1000") # test data
pat <- "(^|\\D)(\\d{2}[.]\\d{2}[.]\\d{2}[.]\\d{2}-\\d)(\\D|$)"
strapply(x, pat, ~ ..2, simplify = c)
## "44.11.36.00-1"
Note that ~ ..2
is short for the function function(...) ..2
which means grab the match to the second parenthesized portion in the regular expression. We could also have written function(x, y, z) y
or x + y + z ~ y
.
Note: The question seems to say that a non-numeric must come directly before and after the string but based on comments that have since disappeared it appears that what was really wanted was that the string be either at the beginning or just after a non-number and must either be at the end or folowed by a non-number. The answer has been so modified.
来源:https://stackoverflow.com/questions/26573611/r-regex-lookbehind-lookahead-issue