How do I regex split by space, avoiding spaces within apostrophes?

前端 未结 3 604
忘掉有多难
忘掉有多难 2021-01-25 12:35

I want \"git log --format=\'(%h) %s\' --abbrev=7 HEAD\" to be split into

[
  \"git\", 
  \"log\",
  \"--format=\'(%h) %s\'\",
  \"--abbrev=7\",
  \         


        
相关标签:
3条回答
  • 2021-01-25 13:29

    As I understand, the idea is to split the string on contiguous spaces except where the spaces are part of a substring surrounded by single quotes. I believe this will work:

    /(?:[^ ']*(?:'[^']+')?[^ ']*)*/
    

    but invite readers to subject it to careful scrutiny.

    demo

    This regex can be made self-documenting by writing it in free-spacing mode:

    /
    (?:         # begin a non-capture group
      [^ ']*    # match 0+ chars other than spaces and single quotes
      (?:       # begin non-capture group
        '[^']+' # match 1+ chars other than single quotes, surrounded
                # by single quotes 
      )?        # end non-capture group and make it optional
      [^ ']*    # match 0+ chars other than spaces and single quotes
    )*          # end non-capture group and execute it 0+ times
    /x          # free-spacing regex definition mode
    

    This obviously will not work if there are nested single quotes.

    @n.'pronouns'm. suggested an alternative regex that also works:

    /([^ "']|'[^'"]*')*/
    

    demo

    0 讨论(0)
  • 2021-01-25 13:35

    As often in life, you have choices.


    1. Use an expression that matches and captures different parts. This can be combined with a replacement function as in

      import re
      string = "git log --format='(%h) %s' --abbrev=7 HEAD"
      
      rx = re.compile(r"'[^']*'|(\s+)")
      
      def replacer(match):
          if match.group(1):
              return "#@#"
          else:
              return match.group(0)
      
      string = rx.sub(replacer, string)
      parts = re.split('#@#', string)
      #                 ^^^ same as in the function replacer
      
    2. You could use the better regex module with (*SKIP)(*FAIL):

      import regex as re
      string = "git log --format='(%h) %s' --abbrev=7 HEAD"
      
      rx = re.compile(r"'[^']*'(*SKIP)(*FAIL)|\s+")
      parts = rx.split(string)
      
    3. Write yourself a little parser:

      def little_parser(string):
          quote = False
          stack = ''
      
          for char in string:
              if char == "'":
                  stack += char
                  quote = not quote
              elif (char == ' ' and not quote):
                  yield stack
                  stack = ''
              else:
                  stack += char
      
          if stack:
              yield stack
      
      for part in little_parser(your_string):
          print(part)
      



    All three will yield

    ['git', 'log', "--format='(%h) %s'", '--abbrev=7', 'HEAD']
    
    0 讨论(0)
  • 2021-01-25 13:36

    I found one possible (albeit ugly) solution in python (which also works with "):

    >>> import re
    >>> foo = '''git log --format='(%h) %s' --foo="a b" --bar='c d' HEAD'''
    >>> re.findall(r'''(\S*'[^']+'\S*|\S*"[^"]+"\S*|\S+)''', foo)
    ['git', 'log', "--format='(%h) %s'", '--foo="a b"', "--bar='c d'", 'HEAD']
    
    
    0 讨论(0)
提交回复
热议问题