How do I regex split by space, avoiding spaces within apostrophes?

前端未结

关注

 3  615

I want \"git log --format=\'(%h) %s\' --abbrev=7 HEAD\" to be split into

[
  \"git\", 
  \"log\",
  \"--format=\'(%h) %s\'\",
  \"--abbrev=7\",
  \


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  时光说笑        
                
              
                            
                2021-01-25 13:29
              
            
            
                                                                       
As I understand, the idea is to split the string on contiguous spaces except where the spaces are part of a substring surrounded by single quotes. I believe this will work:

/(?:[^ ']*(?:'[^']+')?[^ ']*)*/


but invite readers to subject it to careful scrutiny.

demo

This regex can be made self-documenting by writing it in free-spacing mode:

/
(?:         # begin a non-capture group
  [^ ']*    # match 0+ chars other than spaces and single quotes
  (?:       # begin non-capture group
    '[^']+' # match 1+ chars other than single quotes, surrounded
            # by single quotes 
  )?        # end non-capture group and make it optional
  [^ ']*    # match 0+ chars other than spaces and single quotes
)*          # end non-capture group and execute it 0+ times
/x          # free-spacing regex definition mode


This obviously will not work if there are nested single quotes.

@n.'pronouns'm. suggested an alternative regex that also works:

/([^ "']|'[^'"]*')*/


demo
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  逝去的感伤        
                
              
                            
                2021-01-25 13:35
              
            
            
                                                                       
As often in life, you have choices. 




Use an expression that matches and captures different parts. This can be combined with a replacement function as in

import re
string = "git log --format='(%h) %s' --abbrev=7 HEAD"

rx = re.compile(r"'[^']*'|(\s+)")

def replacer(match):
    if match.group(1):
        return "#@#"
    else:
        return match.group(0)

string = rx.sub(replacer, string)
parts = re.split('#@#', string)
#                 ^^^ same as in the function replacer

You could use the better regex module with (*SKIP)(*FAIL):

import regex as re
string = "git log --format='(%h) %s' --abbrev=7 HEAD"

rx = re.compile(r"'[^']*'(*SKIP)(*FAIL)|\s+")
parts = rx.split(string)

Write yourself a little parser:

def little_parser(string):
    quote = False
    stack = ''

    for char in string:
        if char == "'":
            stack += char
            quote = not quote
        elif (char == ' ' and not quote):
            yield stack
            stack = ''
        else:
            stack += char

    if stack:
        yield stack

for part in little_parser(your_string):
    print(part)





All three will yield

['git', 'log', "--format='(%h) %s'", '--abbrev=7', 'HEAD']

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  闹比i        
                
              
                            
                2021-01-25 13:36
              
            
            
                                                                       
I found one possible (albeit ugly) solution in python (which also works with "):

>>> import re
>>> foo = '''git log --format='(%h) %s' --foo="a b" --bar='c d' HEAD'''
>>> re.findall(r'''(\S*'[^']+'\S*|\S*"[^"]+"\S*|\S+)''', foo)
['git', 'log', "--format='(%h) %s'", '--foo="a b"', "--bar='c d'", 'HEAD']


                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复