Python split string by spaces except when in quotes, but keep the quotes

左心房为你撑大大i 提交于 2019-12-12 23:03:35

问题


Am wanting to split the following string:

Quantity [*,'EXTRA 05',*]

With the desired results being:

["Quantity", "[*,'EXTRA 05',*]"]

The closest I have found is using shlex.split, however this removes the internal quotes giving the following result:

['Quantity', '[*,EXTRA 05,*]']

Any suggestions would be greatly appreciated.

EDIT:

Will also require multiple splits such as:

"Quantity [*,'EXTRA 05',*] [*,'EXTRA 09',*]"

To:

["Quantity", "[*,'EXTRA 05',*]", "[*,'EXTRA 09',*]"]


回答1:


To treat string, the basic way is the regular expression tool ( module re )

Given the infos you give (this mean they may be unsufficient) the following code does the job:

import re

r = re.compile('(?! )[^[]+?(?= *\[)'
               '|'
               '\[.+?\]')


s1 = "Quantity [*,'EXTRA 05',*] [*,'EXTRA 09',*]"
print r.findall(s1)
print '---------------'      

s2 = "'zug hug'Quantity boondoggle 'fish face monkey "\
     "dung' [*,'EXTRA 05',*] [*,'EXTRA 09',*]"
print r.findall(s2)

result

['Quantity', "[*,'EXTRA 05',*]", "[*,'EXTRA 09',*]"]  
---------------
["'zug hug'Quantity boondoggle 'fish face monkey dung'", "[*,'EXTRA 05',*]", "[*,'EXTRA 09',*]"]

The regular expression pattern must be undesrtood as follows:

'|' means OR

So the regex pattern expresses two partial RE:
(?! )[^[]+?(?= *\[)
and
\[.+?\]

The first partial RE :

The core is [^[]+
Brackets define a set of characters. The symbol ^ being after the first bracket [ , it means that the set is defined as all the characters that aren't the ones that follow the symbol ^.
Presently [^[] means any character that isn't an opening bracket [ and, as there's a + after this definition of set, [^[]+ means sequence of characters among them there is no opening bracket.

Now, there is a question mark after [^[]+ : it means that the sequence catched must stop before what is symbolized just after the question mark.
Here, what follows the ? is (?= *\[) which is a lookahead assertion, composed of (?=....) that signals it is a positive lookahead assertion and of *\[, this last part being the sequence in front of which the catched sequence must stop. *\[ means: zero,one or more blanks until the opening bracket (backslash \ needed to eliminate the meaning of [ as the opening of a set of characters).

There's also (?! ) in front of the core, it's a negative lookahead assertion: it is necessary to make this partial RE to catch only sequences beginning with a blank, so avoiding to catch successions of blanks. Remove this (?! ) and you'll see the effect.

The second partial RE :

\[.+?\] means : the opening bracket characater [ , a sequence of characters catched by .+? (the dot matching with any character except \n) , this sequence must stop in front of the ending bracket character ] that is the last character to be catched.

.

EDIT

string = "Quantity [*,'EXTRA 05',*] [*,'EXTRA 09',*]"
import re
print re.split(' (?=\[)',string)

result

['Quantity', "[*,'EXTRA 05',*]", "[*,'EXTRA 09',*]"]

!!




回答2:


Advised for picky people, the algorithm WON'T split well every string you pass through it, just strings like:

"Quantity [*,'EXTRA 05',*] [*,'EXTRA 09',*]"

"Quantity [*,'EXTRA 05',*]"

"Quantity [*,'EXTRA 05',*] [*,'EXTRA 10',*] [*,'EXTRA 07',*] [*,'EXTRA 09',*]"

string = "Quantity [*,'EXTRA 05',*] [*,'EXTRA 09',*]"
splitted_string = []

#This adds "Quantity" to the position 0 of splitted_string
splitted_string.append(string.split(" ")[0])     

#The for goes from 1 to the lenght of string.split(" "),increasing the x by 2
#The first iteration x is 1 and x+1 is 2, the second x=3 and x+1=4 etc...
#The first iteration concatenate "[*,'EXTRA" and "05',*]" in one string
#The second iteration concatenate "[*,'EXTRA" and "09',*]" in one string
#If the string would be bigger, it will works
for x in range(1,len(string.split(" ")),2):
    splitted_string.append("%s %s" % (string.split(" ")[x],string.split(" ")[x+1]))

When I execute the code, splitted string at the end contains:

['Quantity', "[*,'EXTRA 05',*]", "[*,'EXTRA 09',*]"]
splitted_string[0] = 'Quantity'
splitted_string[1] = "[*,'EXTRA 05',*]"
splitted_string[2] = "[*,'EXTRA 09',*]"

I think that is exactly what you're looking for. If I'm wrong let me know, or if you need some explanation of the code. I hope it helps




回答3:


Assuming you want a general solution for splitting at spaces but not on space in quotations: I don't know of any Python library to do this, but there doesn't mean there isn't one.

In the absence of a known pre-rolled solution I would simply roll my own. It's relatively easy to scan a string looking for spaces and then use the Python slice functionality to divide up the string into the parts you want. To ignore spaces in quotes you can simply include a flag that switches on encountering a quote symbol to switch the space sensing on and off.

This is some code I knocked up to do this, it is not extensively tested:

def spaceSplit(string) :
  last = 0
  splits = []
  inQuote = None
  for i, letter in enumerate(string) :
    if inQuote :
      if (letter == inQuote) :
        inQuote = None
    else :
      if (letter == '"' or letter == "'") :
        inQuote = letter

    if not inQuote and letter == ' ' :
      splits.append(string[last:i])
      last = i+1

  if last < len(string) :
    splits.append(string[last:])

  return splits


来源:https://stackoverflow.com/questions/20256066/python-split-string-by-spaces-except-when-in-quotes-but-keep-the-quotes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!