How to split a string by commas positioned outside of parenthesis?

前端未结

关注

 10  2059

I got a string of such format:

\"Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)\"

so basicly i

相关标签:

10条回答

醉梦人生

2020-11-27 19:55

My answer will not use regex.

I think simple character scanner with state "in_actor_name" should work. Remember then state "in_actor_name" is terminated either by ')' or by comma in this state.

My try:

s = 'Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)'

in_actor_name = 1
role = ''
name = ''
for c in s:
    if c == ')' or (c == ',' and in_actor_name):
        in_actor_name = 1
        name = name.strip()
        if name:
            print "%s: %s" % (name, role)
        name = ''
        role = ''
    elif c == '(':
        in_actor_name = 0
    else:
        if in_actor_name:
            name += c
        else:
            role += c
if name:
    print "%s: %s" % (name, role)

Output:

Wilbur Smith: Billy, son of John
Eddie Murphy: John
Elvis Presley: 
Jane Doe: Jane Doe

0 讨论(0)

孤街浪徒

2020-11-27 19:58

split by ")"

>>> s="Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> s.split(")")
['Wilbur Smith (Billy, son of John', ', Eddie Murphy (John', ', Elvis Presley, Jane Doe (Jane Doe', '']
>>> for i in s.split(")"):
...   print i.split("(")
...
['Wilbur Smith ', 'Billy, son of John']
[', Eddie Murphy ', 'John']
[', Elvis Presley, Jane Doe ', 'Jane Doe']
['']

you can do further checking to get those names that doesn't come with ().

0 讨论(0)

一整个雨季

2020-11-27 19:59
One way to do it is to use findall with a regex that greedily matches things that can go between separators. eg:
```
>>> s = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> r = re.compile(r'(?:[^,(]|\([^)]*\))+')
>>> r.findall(s)
['Wilbur Smith (Billy, son of John)', ' Eddie Murphy (John)', ' Elvis Presley', ' Jane Doe (Jane Doe)']
```
The regex above matches one or more:
- non-comma, non-open-paren characters
- strings that start with an open paren, contain 0 or more non-close-parens, and then a close paren
One quirk about this approach is that adjacent separators are treated as a single separator. That is, you won't see an empty string. That may be a bug or a feature depending on your use-case.

Also note that regexes are not suitable for cases where nesting is a possibility. So for example, this would split incorrectly:
```
"Wilbur Smith (son of John (Johnny, son of James), aka Billy), Eddie Murphy (John)"
```
If you need to deal with nesting your best bet would be to partition the string into parens, commas, and everthing else (essentially tokenizing it -- this part could still be done with regexes) and then walk through those tokens reassembling the fields, keeping track of your nesting level as you go (this keeping track of the nesting level is what regexes are incapable of doing on their own).
0 讨论(0)
发布评论:

提交评论
- 加载中...
你的背包

2020-11-27 20:00
I certainly agree with @Wogan above, that using the CSV moudle is a good approach. Having said that if you still want to try a regex solution give this a try, but you will have to adapt it to Python dialect
```
string.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)
```
HTH
0 讨论(0)
发布评论:

提交评论
- 加载中...
囚心锁ツ

2020-11-27 20:00

None of the answers above are correct if there are any errors or noise in your data.

It's easy to come up with a good solution if you know the data is right every time. But what happens if there are formatting errors? What do you want to have happen?

Suppose there are nesting parentheses? Suppose there are unmatched parentheses? Suppose the string ends with or begins with a comma, or has two in a row?

All of the above solutions will produce more or less garbage and not report it to you.

Were it up to me, I'd start with a pretty strict restriction on what "correct" data was - no nesting parentheses, no unmatched parentheses, and no empty segments before, between or after comments - validate as I went, and then raise an exception if I wasn't able to validate.

0 讨论(0)
发布评论:

提交评论
- 加载中...
难免孤独

2020-11-27 20:06

I think the best way to approach this would be to use python's built-in csv module.

Because the csv module only allows a one character quotechar, you would need to do a replace on your inputs to convert () to something like | or ". Then make sure you are using an appropriate dialect and off you go.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页