Regex group capture in R with multiple capture-groups

徘徊边缘 提交于 2019-11-26 05:59:55

问题


In R, is it possible to extract group capture from a regular expression match? As far as I can tell, none of grep, grepl, regexpr, gregexpr, sub, or gsub return the group captures.

I need to extract key-value pairs from strings that are encoded thus:

\\((.*?) :: (0\\.[0-9]+)\\)

I can always just do multiple full-match greps, or do some outside (non-R) processing, but I was hoping I can do it all within R. Is there\'s a function or a package that provides such a function to do this?


回答1:


str_match(), from the stringr package, will do this. It returns a character matrix with one column for each group in the match (and one for the whole match):

> s = c("(sometext :: 0.1231313213)", "(moretext :: 0.111222)")
> str_match(s, "\\((.*?) :: (0\\.[0-9]+)\\)")
     [,1]                         [,2]       [,3]          
[1,] "(sometext :: 0.1231313213)" "sometext" "0.1231313213"
[2,] "(moretext :: 0.111222)"     "moretext" "0.111222"    



回答2:


gsub does this, from your example:

gsub("\\((.*?) :: (0\\.[0-9]+)\\)","\\1 \\2", "(sometext :: 0.1231313213)")
[1] "sometext 0.1231313213"

you need to double escape the \s in the quotes then they work for the regex.

Hope this helps.




回答3:


Try regmatches() and regexec():

regmatches("(sometext :: 0.1231313213)",regexec("\\((.*?) :: (0\\.[0-9]+)\\)","(sometext :: 0.1231313213)"))
[[1]]
[1] "(sometext :: 0.1231313213)" "sometext"                   "0.1231313213"



回答4:


gsub() can do this and return only the capture group:

However, in order for this to work, you must explicitly select elements outside your capture group as mentioned in the gsub() help.

(...) elements of character vectors 'x' which are not substituted will be returned unchanged.

So if your text to be selected lies in the middle of some string, adding .* before and after the capture group should allow you to only return it.

gsub(".*\\((.*?) :: (0\\.[0-9]+)\\).*","\\1 \\2", "(sometext :: 0.1231313213)") [1] "sometext 0.1231313213"




回答5:


I like perl compatible regular expressions. Probably someone else does too...

Here is a function that does perl compatible regular expressions and matches the functionality of functions in other languages that I am used to:

regexpr_perl <- function(expr, str) {
  match <- regexpr(expr, str, perl=T)
  matches <- character(0)
  if (attr(match, 'match.length') >= 0) {
    capture_start <- attr(match, 'capture.start')
    capture_length <- attr(match, 'capture.length')
    total_matches <- 1 + length(capture_start)
    matches <- character(total_matches)
    matches[1] <- substr(str, match, match + attr(match, 'match.length') - 1)
    if (length(capture_start) > 1) {
      for (i in 1:length(capture_start)) {
        matches[i + 1] <- substr(str, capture_start[[i]], capture_start[[i]] + capture_length[[i]] - 1)
      }
    }
  }
  matches
}



回答6:


This is how I ended up working around this problem. I used two separate regexes to match the first and second capture groups and run two gregexpr calls, then pull out the matched substrings:

regex.string <- "(?<=\\().*?(?= :: )"
regex.number <- "(?<= :: )\\d\\.\\d+"

match.string <- gregexpr(regex.string, str, perl=T)[[1]]
match.number <- gregexpr(regex.number, str, perl=T)[[1]]

strings <- mapply(function (start, len) substr(str, start, start+len-1),
                  match.string,
                  attr(match.string, "match.length"))
numbers <- mapply(function (start, len) as.numeric(substr(str, start, start+len-1)),
                  match.number,
                  attr(match.number, "match.length"))



回答7:


Solution with strcapture from the utils:

x <- c("key1 :: 0.01",
       "key2 :: 0.02")
strcapture(pattern = "(.*) :: (0\\.[0-9]+)",
           x = x,
           proto = list(key = character(), value = double()))
#>    key value
#> 1 key1  0.01
#> 2 key2  0.02



回答8:


As suggested in the stringr package, this can be achieved using either str_match() or str_extract().

Adapted from the manual:

library(stringr)

strings <- c(" 219 733 8965", "329-293-8753 ", "banana", 
             "239 923 8115 and 842 566 4692",
             "Work: 579-499-7527", "$1000",
             "Home: 543.355.3679")
phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"

Extracting and combining our groups:

str_extract_all(strings, phone, simplify=T)
#      [,1]           [,2]          
# [1,] "219 733 8965" ""            
# [2,] "329-293-8753" ""            
# [3,] ""             ""            
# [4,] "239 923 8115" "842 566 4692"
# [5,] "579-499-7527" ""            
# [6,] ""             ""            
# [7,] "543.355.3679" ""   

Indicating groups with an output matrix (we're interested in columns 2+):

str_match_all(strings, phone)
# [[1]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "219 733 8965" "219" "733" "8965"
# 
# [[2]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "329-293-8753" "329" "293" "8753"
# 
# [[3]]
#      [,1] [,2] [,3] [,4]
# 
# [[4]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "239 923 8115" "239" "923" "8115"
# [2,] "842 566 4692" "842" "566" "4692"
# 
# [[5]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "579-499-7527" "579" "499" "7527"
# 
# [[6]]
#      [,1] [,2] [,3] [,4]
# 
# [[7]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "543.355.3679" "543" "355" "3679"



回答9:


This can be done using the package unglue, taking the example from the selected answer:

# install.packages("unglue")
library(unglue)

s <- c("(sometext :: 0.1231313213)", "(moretext :: 0.111222)")
unglue_data(s, "({x} :: {y})")
#>          x            y
#> 1 sometext 0.1231313213
#> 2 moretext     0.111222

Or starting from a data frame

df <- data.frame(col = s)
unglue_unnest(df, col, "({x} :: {y})",remove = FALSE)
#>                          col        x            y
#> 1 (sometext :: 0.1231313213) sometext 0.1231313213
#> 2     (moretext :: 0.111222) moretext     0.111222

you can get the raw regex from the unglue pattern, optionally with named capture :

unglue_regex("({x} :: {y})")
#>             ({x} :: {y}) 
#> "^\\((.*?) :: (.*?)\\)$"

unglue_regex("({x} :: {y})",named_capture = TRUE)
#>                     ({x} :: {y}) 
#> "^\\((?<x>.*?) :: (?<y>.*?)\\)$"

More info : https://github.com/moodymudskipper/unglue/blob/master/README.md



来源:https://stackoverflow.com/questions/952275/regex-group-capture-in-r-with-multiple-capture-groups

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!