Regex with named capture groups getting all matches in Ruby

前端 未结 10 1972
滥情空心
滥情空心 2021-02-02 08:12

I have a string:

s=\"123--abc,123--abc,123--abc\"

I tried using Ruby 1.9\'s new feature \"named groups\" to fetch all named group info:

相关标签:
10条回答
  • 2021-02-02 08:54

    Chiming in super-late, but here's a simple way of replicating String#scan but getting the matchdata instead:

    matches = []
    foo.scan(regex){ matches << $~ }
    

    matches now contains the MatchData objects that correspond to scanning the string.

    0 讨论(0)
  • 2021-02-02 09:01

    I really liked @Umut-Utkan's solution, but it didn't quite do what I wanted so I rewrote it a bit (note, the below might not be beautiful code, but it seems to work)

    class String
      def scan2(regexp)
        names = regexp.names
        captures = Hash.new
        scan(regexp).collect do |match|
          nzip = names.zip(match)
          nzip.each do |m|
            captgrp = m[0].to_sym
            captures.add(captgrp, m[1])
          end
        end
        return captures
      end
    end
    

    Now, if you do

    p '12f3g4g5h5h6j7j7j'.scan2(/(?<alpha>[a-zA-Z])(?<digit>[0-9])/)
    

    You get

    {:alpha=>["f", "g", "g", "h", "h", "j", "j"], :digit=>["3", "4", "5", "5", "6", "7", "7"]}
    

    (ie. all the alpha characters found in one array, and all the digits found in another array). Depending on your purpose for scanning, this might be useful. Anyway, I love seeing examples of how easy it is to rewrite or extend core Ruby functionality with just a few lines!

    0 讨论(0)
  • 2021-02-02 09:05

    A year ago I wanted regular expressions that were more easy to read and named the captures, so I made the following addition to String (should maybe not be there, but it was convenient at the time):

    scan2.rb:

    class String  
      #Works as scan but stores the result in a hash indexed by variable/constant names (regexp PLACEHOLDERS) within parantheses.
      #Example: Given the (constant) strings BTF, RCVR and SNDR and the regexp /#BTF# (#RCVR#) (#SNDR#)/
      #the matches will be returned in a hash like: match[:RCVR] = <the match> and match[:SNDR] = <the match>
      #Note: The #STRING_VARIABLE_OR_CONST# syntax has to be used. All occurences of #STRING# will work as #{STRING}
      #but is needed for the method to see the names to be used as indices.
      def scan2(regexp2_str, mark='#')
        regexp              = regexp2_str.to_re(mark)                       #Evaluates the strings. Note: Must be reachable from here!
        hash_indices_array  = regexp2_str.scan(/\(#{mark}(.*?)#{mark}\)/).flatten #Look for string variable names within (#VAR#) or # replaced by <mark>
        match_array         = self.scan(regexp)
    
        #Save matches in hash indexed by string variable names:
        match_hash = Hash.new
        match_array.flatten.each_with_index do |m, i|
          match_hash[hash_indices_array[i].to_sym] = m
        end
        return match_hash  
      end
    
      def to_re(mark='#')
        re = /#{mark}(.*?)#{mark}/
        return Regexp.new(self.gsub(re){eval $1}, Regexp::MULTILINE)    #Evaluates the strings, creates RE. Note: Variables must be reachable from here!
      end
    
    end
    

    Example usage (irb1.9):

    > load 'scan2.rb'
    > AREA = '\d+'
    > PHONE = '\d+'
    > NAME = '\w+'
    > "1234-567890 Glenn".scan2('(#AREA#)-(#PHONE#) (#NAME#)')
    => {:AREA=>"1234", :PHONE=>"567890", :NAME=>"Glenn"}
    

    Notes:

    Of course it would have been more elegant to put the patterns (e.g. AREA, PHONE...) in a hash and add this hash with patterns to the arguments of scan2.

    0 讨论(0)
  • 2021-02-02 09:10

    @Nakilon is correct showing scan with a regex, however you don't even need to venture into regex land if you don't want to:

    s = "123--abc,123--abc,123--abc"
    s.split(',')
    #=> ["123--abc", "123--abc", "123--abc"]
    
    s.split(',').inject([]) { |a,s| a << s.split('--'); a }
    #=> [["123", "abc"], ["123", "abc"], ["123", "abc"]]
    

    This returns an array of arrays, which is convenient if you have multiple occurrences and need to see/process them all.

    s.split(',').inject({}) { |h,s| n,v = s.split('--'); h[n] = v; h }
    #=> {"123"=>"abc"}
    

    This returns a hash, which, because the elements have the same key, has only the unique key value. This is good when you have a bunch of duplicate keys but want the unique ones. Its downside occurs if you need the unique values associated with the keys, but that appears to be a different question.

    0 讨论(0)
  • 2021-02-02 09:10

    If using ruby >=1.9 and the named captures, you could:

    class String 
      def scan2(regexp2_str, placeholders = {})
        return regexp2_str.to_re(placeholders).match(self)
      end
    
      def to_re(placeholders = {})
        re2 = self.dup
        separator = placeholders.delete(:SEPARATOR) || '' #Returns and removes separator if :SEPARATOR is set.
        #Search for the pattern placeholders and replace them with the regex
        placeholders.each do |placeholder, regex|
          re2.sub!(separator + placeholder.to_s + separator, "(?<#{placeholder}>#{regex})")
        end    
        return Regexp.new(re2, Regexp::MULTILINE)    #Returns regex using named captures.
      end
    end
    

    Usage (ruby >=1.9):

    > "1234:Kalle".scan2("num4:name", num4:'\d{4}', name:'\w+')
    => #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">
    

    or

    > re="num4:name".to_re(num4:'\d{4}', name:'\w+')
    => /(?<num4>\d{4}):(?<name>\w+)/m
    
    > m=re.match("1234:Kalle")
    => #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">
    > m[:num4]
    => "1234"
    > m[:name]
    => "Kalle"
    

    Using the separator option:

    > "1234:Kalle".scan2("#num4#:#name#", SEPARATOR:'#', num4:'\d{4}', name:'\w+')
    => #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">
    
    0 讨论(0)
  • 2021-02-02 09:12

    I like the match_all given by John, but I think it has an error.

    The line:

      match_datas << md
    

    works if there are no captures () in the regex.

    This code gives the whole line up to and including the pattern matched/captured by the regex. (The [0] part of MatchData) If the regex has capture (), then this result is probably not what the user (me) wants in the eventual output.

    I think in the case where there are captures () in regex, the correct code should be:

      match_datas << md[1]
    

    The eventual output of match_datas will be an array of pattern capture matches starting from match_datas[0]. This is not quite what may be expected if a normal MatchData is wanted which includes a match_datas[0] value which is the whole matched substring followed by match_datas[1], match_datas[[2],.. which are the captures (if any) in the regex pattern.

    Things are complex - which may be why match_all was not included in native MatchData.

    0 讨论(0)
提交回复
热议问题