Regex with named capture groups getting all matches in Ruby

前端未结

关注

 10  1955

I have a string:

s=\"123--abc,123--abc,123--abc\"

I tried using Ruby 1.9\'s new feature \"named groups\" to fetch all named group info:

相关标签:

10条回答

孤独总比滥情好

2021-02-02 08:54
Chiming in super-late, but here's a simple way of replicating String#scan but getting the matchdata instead:
```
matches = []
foo.scan(regex){ matches << $~ }
```
matches now contains the MatchData objects that correspond to scanning the string.
0 讨论(0)
发布评论:

提交评论
- 加载中...
时光取名叫无心

2021-02-02 09:01
I really liked @Umut-Utkan's solution, but it didn't quite do what I wanted so I rewrote it a bit (note, the below might not be beautiful code, but it seems to work)
```
class String
  def scan2(regexp)
    names = regexp.names
    captures = Hash.new
    scan(regexp).collect do |match|
      nzip = names.zip(match)
      nzip.each do |m|
        captgrp = m[0].to_sym
        captures.add(captgrp, m[1])
      end
    end
    return captures
  end
end
```
Now, if you do
```
p '12f3g4g5h5h6j7j7j'.scan2(/(?<alpha>[a-zA-Z])(?<digit>[0-9])/)
```
You get
```
{:alpha=>["f", "g", "g", "h", "h", "j", "j"], :digit=>["3", "4", "5", "5", "6", "7", "7"]}
```
(ie. all the alpha characters found in one array, and all the digits found in another array). Depending on your purpose for scanning, this might be useful. Anyway, I love seeing examples of how easy it is to rewrite or extend core Ruby functionality with just a few lines!
0 讨论(0)
发布评论:

提交评论
- 加载中...

渐次进展

2021-02-02 09:05

A year ago I wanted regular expressions that were more easy to read and named the captures, so I made the following addition to String (should maybe not be there, but it was convenient at the time):

scan2.rb:

class String  
  #Works as scan but stores the result in a hash indexed by variable/constant names (regexp PLACEHOLDERS) within parantheses.
  #Example: Given the (constant) strings BTF, RCVR and SNDR and the regexp /#BTF# (#RCVR#) (#SNDR#)/
  #the matches will be returned in a hash like: match[:RCVR] = <the match> and match[:SNDR] = <the match>
  #Note: The #STRING_VARIABLE_OR_CONST# syntax has to be used. All occurences of #STRING# will work as #{STRING}
  #but is needed for the method to see the names to be used as indices.
  def scan2(regexp2_str, mark='#')
    regexp              = regexp2_str.to_re(mark)                       #Evaluates the strings. Note: Must be reachable from here!
    hash_indices_array  = regexp2_str.scan(/\(#{mark}(.*?)#{mark}\)/).flatten #Look for string variable names within (#VAR#) or # replaced by <mark>
    match_array         = self.scan(regexp)

    #Save matches in hash indexed by string variable names:
    match_hash = Hash.new
    match_array.flatten.each_with_index do |m, i|
      match_hash[hash_indices_array[i].to_sym] = m
    end
    return match_hash  
  end

  def to_re(mark='#')
    re = /#{mark}(.*?)#{mark}/
    return Regexp.new(self.gsub(re){eval $1}, Regexp::MULTILINE)    #Evaluates the strings, creates RE. Note: Variables must be reachable from here!
  end

end

Example usage (irb1.9):

> load 'scan2.rb'
> AREA = '\d+'
> PHONE = '\d+'
> NAME = '\w+'
> "1234-567890 Glenn".scan2('(#AREA#)-(#PHONE#) (#NAME#)')
=> {:AREA=>"1234", :PHONE=>"567890", :NAME=>"Glenn"}

Notes:

Of course it would have been more elegant to put the patterns (e.g. AREA, PHONE...) in a hash and add this hash with patterns to the arguments of scan2.

0 讨论(0)

旧巷少年郎

2021-02-02 09:10
@Nakilon is correct showing scan with a regex, however you don't even need to venture into regex land if you don't want to:
```
s = "123--abc,123--abc,123--abc"
s.split(',')
#=> ["123--abc", "123--abc", "123--abc"]

s.split(',').inject([]) { |a,s| a << s.split('--'); a }
#=> [["123", "abc"], ["123", "abc"], ["123", "abc"]]
```
This returns an array of arrays, which is convenient if you have multiple occurrences and need to see/process them all.
```
s.split(',').inject({}) { |h,s| n,v = s.split('--'); h[n] = v; h }
#=> {"123"=>"abc"}
```
This returns a hash, which, because the elements have the same key, has only the unique key value. This is good when you have a bunch of duplicate keys but want the unique ones. Its downside occurs if you need the unique values associated with the keys, but that appears to be a different question.
0 讨论(0)
发布评论:

提交评论
- 加载中...

伪装坚强ぢ

2021-02-02 09:10

If using ruby >=1.9 and the named captures, you could:

class String 
  def scan2(regexp2_str, placeholders = {})
    return regexp2_str.to_re(placeholders).match(self)
  end

  def to_re(placeholders = {})
    re2 = self.dup
    separator = placeholders.delete(:SEPARATOR) || '' #Returns and removes separator if :SEPARATOR is set.
    #Search for the pattern placeholders and replace them with the regex
    placeholders.each do |placeholder, regex|
      re2.sub!(separator + placeholder.to_s + separator, "(?<#{placeholder}>#{regex})")
    end    
    return Regexp.new(re2, Regexp::MULTILINE)    #Returns regex using named captures.
  end
end

Usage (ruby >=1.9):

> "1234:Kalle".scan2("num4:name", num4:'\d{4}', name:'\w+')
=> #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">

> re="num4:name".to_re(num4:'\d{4}', name:'\w+')
=> /(?<num4>\d{4}):(?<name>\w+)/m

> m=re.match("1234:Kalle")
=> #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">
> m[:num4]
=> "1234"
> m[:name]
=> "Kalle"

Using the separator option:

> "1234:Kalle".scan2("#num4#:#name#", SEPARATOR:'#', num4:'\d{4}', name:'\w+')
=> #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">

0 讨论(0)

难免孤独

2021-02-02 09:12
I like the match_all given by John, but I think it has an error.

The line:
```
  match_datas << md
```
works if there are no captures () in the regex.

This code gives the whole line up to and including the pattern matched/captured by the regex. (The [0] part of MatchData) If the regex has capture (), then this result is probably not what the user (me) wants in the eventual output.

I think in the case where there are captures () in regex, the correct code should be:
```
  match_datas << md[1]
```
The eventual output of match_datas will be an array of pattern capture matches starting from match_datas[0]. This is not quite what may be expected if a normal MatchData is wanted which includes a match_datas[0] value which is the whole matched substring followed by match_datas[1], match_datas[[2],.. which are the captures (if any) in the regex pattern.

Things are complex - which may be why match_all was not included in native MatchData.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页