What I was trying to achieve, was something like this:
>>> camel_case_split("CamelCaseXYZ")
['Camel', 'Case', 'XYZ']
>>> camel_case_split("XYZCamelCase")
['XYZ', 'Camel', 'Case']
So I searched and found this perfect regular expression:
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])
As the next logical step I tried:
>>> re.split("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['CamelCaseXYZ']
Why does this not work, and how do I achieve the result from the linked question in python?
Edit: Solution summary
I tested all provided solutions with a few test cases:
string: ''
AplusKminus: ['']
casimir_et_hippolyte: []
two_hundred_success: []
kalefranz: string index out of range # with modification: either [] or ['']
string: ' '
AplusKminus: [' ']
casimir_et_hippolyte: []
two_hundred_success: [' ']
kalefranz: [' ']
string: 'lower'
all algorithms: ['lower']
string: 'UPPER'
all algorithms: ['UPPER']
string: 'Initial'
all algorithms: ['Initial']
string: 'dromedaryCase'
AplusKminus: ['dromedary', 'Case']
casimir_et_hippolyte: ['dromedary', 'Case']
two_hundred_success: ['dromedary', 'Case']
kalefranz: ['Dromedary', 'Case'] # with modification: ['dromedary', 'Case']
string: 'CamelCase'
all algorithms: ['Camel', 'Case']
string: 'ABCWordDEF'
AplusKminus: ['ABC', 'Word', 'DEF']
casimir_et_hippolyte: ['ABC', 'Word', 'DEF']
two_hundred_success: ['ABC', 'Word', 'DEF']
kalefranz: ['ABCWord', 'DEF']
In summary you could say the solution by @kalefranz does not match the question (see the last case) and the solution by @casimir et hippolyte eats a single space, and thereby violates the idea that a split should not change the individual parts. The only difference among the remaining two alternatives is that my solution returns a list with the empty string on an empty string input and the solution by @200_success returns an empty list. I don't know how the python community stands on that issue, so I say: I am fine with either one. And since 200_success's solution is simpler, I accepted it as the correct answer.
As @AplusKminus has explained, re.split()
never splits on an empty pattern match. Therefore, instead of splitting, you should try finding the components you are interested in.
Here is a solution using re.finditer()
that emulates splitting:
def camel_case_split(identifier):
matches = finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier)
return [m.group(0) for m in matches]
Use re.sub()
and split()
import re
name = 'CamelCaseTest123'
splitted = re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', name)).split()
Result
'CamelCaseTest123' -> ['Camel', 'Case', 'Test123']
'CamelCaseXYZ' -> ['Camel', 'Case', 'XYZ']
'XYZCamelCase' -> ['XYZ', 'Camel', 'Case']
'XYZ' -> ['XYZ']
'IPAddress' -> ['IP', 'Address']
Most of the time when you don't need to check the format of a string, a global research is more simple than a split (for the same result):
re.findall(r'[A-Z](?:[a-z]+|[A-Z]*(?=[A-Z]|$))', 'CamelCaseXYZ')
returns
['Camel', 'Case', 'XYZ']
To deal with dromedary too, you can use:
re.findall(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z]|$)', 'camelCaseXYZ')
Note: (?=[A-Z]|$)
can be shorten using a double negation (a negative lookahead with a negated character class): (?![^A-Z])
The documentation for python's re.split
says:
Note that split will never split a string on an empty pattern match.
When seeing this:
>>> re.findall("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['', '']
it becomes clear, why the split does not work as expected. The re
module finds empty matches, just as intended by the regular expression.
Since the documentation states that this is not a bug, but rather intended behavior, you have to work around that when trying to create a camel case split:
def camel_case_split(identifier):
matches = finditer('(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])', identifier)
split_string = []
# index of beginning of slice
previous = 0
for match in matches:
# get slice
split_string.append(identifier[previous:match.start()])
# advance index
previous = match.start()
# get remaining string
split_string.append(identifier[previous:])
return split_string
I just stumbled upon this case and wrote a regular expression to solve it. It should work for any group of words, actually.
RE_WORDS = re.compile(r'''
# Find words in a string. Order matters!
[A-Z]+(?=[A-Z][a-z]) | # All upper case before a capitalized word
[A-Z]?[a-z]+ | # Capitalized words / all lower case
[A-Z]+ | # All upper case
\d+ # Numbers
''', re.VERBOSE)
The key here is the lookahead on the first possible case. It will match (and preserve) uppercase words before capitalized ones:
assert RE_WORDS.findall('FOOBar') == ['FOO', 'Bar']
Here's another solution that requires less code and no complicated regular expressions:
def camel_case_split(string):
bldrs = [[string[0].upper()]]
for c in string[1:]:
if bldrs[-1][-1].islower() and c.isupper():
bldrs.append([c])
else:
bldrs[-1].append(c)
return [''.join(bldr) for bldr in bldrs]
Edit
The above code contains an optimization that avoids rebuilding the entire string with every appended character. Leaving out that optimization, a simpler version (with comments) might look like
def camel_case_split2(string):
# set the logic for creating a "break"
def is_transition(c1, c2):
return c1.islower() and c2.isupper()
# start the builder list with the first character
# enforce upper case
bldr = [string[0].upper()]
for c in string[1:]:
# get the last character in the last element in the builder
# note that strings can be addressed just like lists
previous_character = bldr[-1][-1]
if is_transition(previous_character, c):
# start a new element in the list
bldr.append(c)
else:
# append the character to the last string
bldr[-1] += c
return bldr
I know that the question added the tag of regex. But still, I always try to stay as far away from regex as possible. So, here is my solution without regex:
def split_camel(text, char):
if len(text) <= 1: # To avoid adding a wrong space in the beginning
return text+char
if char.isupper() and text[-1].islower(): # Regular Camel case
return text + " " + char
elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
return text[:-1] + " " + text[-1] + char
else: # Do nothing part
return text + char
text = "PathURLFinder"
text = reduce(split_camel, a, "")
print text
# prints "Path URL Finder"
print text.split(" ")
# prints "['Path', 'URL', 'Finder']"
EDIT: As suggested, here is the code to put the functionality in a single function.
def split_camel(text):
def splitter(text, char):
if len(text) <= 1: # To avoid adding a wrong space in the beginning
return text+char
if char.isupper() and text[-1].islower(): # Regular Camel case
return text + " " + char
elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
return text[:-1] + " " + text[-1] + char
else: # Do nothing part
return text + char
converted_text = reduce(splitter, text, "")
return converted_text.split(" ")
split_camel("PathURLFinder")
# prints ['Path', 'URL', 'Finder']
I think below is the optimim
Def count_word(): Return(re.findall(‘[A-Z]?[a-z]+’, input(‘please enter your string’))
Print(count_word())
I found regexp complicated to build, hard to debug and with unpredictable execution speed. I like to use them in the search/replace function of my IDE but I try to avoid them in programs.
Here is a quite straightforward solution in pure python:
def camel_case_split(s):
idx = [0] + [i for i, e in enumerate(s) if e.isupper()] + [len(s)]
return [s[x:y] for x, y in zip(idx, idx[1:]) if x < y]
And some tests :
def test():
TESTS = [
("CamelCaseWordT", ['Camel', 'Case', 'Word', 'T']),
("CamelCaseWordTa", ['Camel', 'Case', 'Word', 'Ta']),
("aCamelCaseWordTa", ['a', 'Camel', 'Case', 'Word', 'Ta']),
("aCamelCaseWordT", ['a', 'Camel', 'Case', 'Word', 'T']),
("Ta", ['Ta']),
("aT", ['a', 'T']),
("a", ['a']),
("T", ['T']),
("", []),
]
for (q,a) in TESTS:
assert camel_case_split(q) == a
if __name__ == "__main__":
test()
来源:https://stackoverflow.com/questions/29916065/how-to-do-camelcase-split-in-python