I got a string of such format:
\"Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)\"
so basicly i
My answer will not use regex.
I think simple character scanner with state "in_actor_name
" should work. Remember then state "in_actor_name
" is terminated either by ')' or by comma in this state.
My try:
s = 'Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)'
in_actor_name = 1
role = ''
name = ''
for c in s:
if c == ')' or (c == ',' and in_actor_name):
in_actor_name = 1
name = name.strip()
if name:
print "%s: %s" % (name, role)
name = ''
role = ''
elif c == '(':
in_actor_name = 0
else:
if in_actor_name:
name += c
else:
role += c
if name:
print "%s: %s" % (name, role)
Output:
Wilbur Smith: Billy, son of John
Eddie Murphy: John
Elvis Presley:
Jane Doe: Jane Doe
split by ")"
>>> s="Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> s.split(")")
['Wilbur Smith (Billy, son of John', ', Eddie Murphy (John', ', Elvis Presley, Jane Doe (Jane Doe', '']
>>> for i in s.split(")"):
... print i.split("(")
...
['Wilbur Smith ', 'Billy, son of John']
[', Eddie Murphy ', 'John']
[', Elvis Presley, Jane Doe ', 'Jane Doe']
['']
you can do further checking to get those names that doesn't come with ().
One way to do it is to use findall
with a regex that greedily matches things that can go between separators. eg:
>>> s = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> r = re.compile(r'(?:[^,(]|\([^)]*\))+')
>>> r.findall(s)
['Wilbur Smith (Billy, son of John)', ' Eddie Murphy (John)', ' Elvis Presley', ' Jane Doe (Jane Doe)']
The regex above matches one or more:
One quirk about this approach is that adjacent separators are treated as a single separator. That is, you won't see an empty string. That may be a bug or a feature depending on your use-case.
Also note that regexes are not suitable for cases where nesting is a possibility. So for example, this would split incorrectly:
"Wilbur Smith (son of John (Johnny, son of James), aka Billy), Eddie Murphy (John)"
If you need to deal with nesting your best bet would be to partition the string into parens, commas, and everthing else (essentially tokenizing it -- this part could still be done with regexes) and then walk through those tokens reassembling the fields, keeping track of your nesting level as you go (this keeping track of the nesting level is what regexes are incapable of doing on their own).
I certainly agree with @Wogan above, that using the CSV moudle is a good approach. Having said that if you still want to try a regex solution give this a try, but you will have to adapt it to Python dialect
string.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)
HTH
None of the answers above are correct if there are any errors or noise in your data.
It's easy to come up with a good solution if you know the data is right every time. But what happens if there are formatting errors? What do you want to have happen?
Suppose there are nesting parentheses? Suppose there are unmatched parentheses? Suppose the string ends with or begins with a comma, or has two in a row?
All of the above solutions will produce more or less garbage and not report it to you.
Were it up to me, I'd start with a pretty strict restriction on what "correct" data was - no nesting parentheses, no unmatched parentheses, and no empty segments before, between or after comments - validate as I went, and then raise an exception if I wasn't able to validate.
I think the best way to approach this would be to use python's built-in csv module.
Because the csv module only allows a one character quotechar
, you would need to do a replace on your inputs to convert ()
to something like |
or "
. Then make sure you are using an appropriate dialect and off you go.