I got a string of such format:
\"Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)\"
so basicly i
Here's a general technique I've used in the past for such cases:
Use the sub
function of the re
module with a function as replacement argument. The function keeps track of opening and closing parens, brackets and braces, as well as single and double quotes, and performs a replacement only outside of such bracketed and quoted substrings. You can then replace the non-bracketed/quoted commas with another character which you're sure doesn't appear in the string (I use the ASCII/Unicode group-separator: chr(29) code), then do a simple string.split on that character. Here's the code:
import re
def srchrepl(srch, repl, string):
"""Replace non-bracketed/quoted occurrences of srch with repl in string"""
resrchrepl = re.compile(r"""(?P<lbrkt>[([{])|(?P<quote>['"])|(?P<sep>["""
+ srch + """])|(?P<rbrkt>[)\]}])""")
return resrchrepl.sub(_subfact(repl), string)
def _subfact(repl):
"""Replacement function factory for regex sub method in srchrepl."""
level = 0
qtflags = 0
def subf(mo):
nonlocal level, qtflags
sepfound = mo.group('sep')
if sepfound:
if level == 0 and qtflags == 0:
return repl
else:
return mo.group(0)
elif mo.group('lbrkt'):
level += 1
return mo.group(0)
elif mo.group('quote') == "'":
qtflags ^= 1 # toggle bit 1
return "'"
elif mo.group('quote') == '"':
qtflags ^= 2 # toggle bit 2
return '"'
elif mo.group('rbrkt'):
level -= 1
return mo.group(0)
return subf
If you don't have nonlocal
in your version of Python, just change it to global
and define level
and qtflags
at the module level.
Here's how it's used:
>>> GRPSEP = chr(29)
>>> string = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> lst = srchrepl(',', GRPSEP, string).split(GRPSEP)
>>> lst
['Wilbur Smith (Billy, son of John)', ' Eddie Murphy (John)', ' Elvis Presley', ' Jane Doe (Jane Doe)']
s = re.split(r',\s*(?=[^)]*(?:\(|$))', x)
The lookahead matches everything up to the next open-parenthesis or to the end of the string, iff there's no close-parenthesis in between. That ensures that the comma is not inside a set of parentheses.
This post helped me a lot. I was looking to split a string by commas positioned outside quotes. I used this as a starter. My final line of code was regEx = re.compile(r'(?:[^,"]|"[^"]*")+')
This did the trick. Thanks a ton.
An attempt on human-readable regex:
import re
regex = re.compile(r"""
# name starts and ends on word boundary
# no '(' or commas in the name
(?P<name>\b[^(,]+\b)
\s*
# everything inside parentheses is a role
(?:\(
(?P<role>[^)]+)
\))? # role is optional
""", re.VERBOSE)
s = ("Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley,"
"Jane Doe (Jane Doe)")
print re.findall(regex, s)
Output:
[('Wilbur Smith', 'Billy, son of John'), ('Eddie Murphy', 'John'),
('Elvis Presley', ''), ('Jane Doe', 'Jane Doe')]