I am trying to match all consecutive all caps words/phrases using regex in Python. Given the following:
text = \"The following words are ALL CAPS. The follow
Keeping your regex, you can use strip()
and filter
:
string = "The following words are ALL CAPS. The following word is in CAPS."
result = filter(None, [x.strip() for x in re.findall(r"\b[A-Z\s]+\b", string)])
# ['ALL CAPS', 'CAPS']
Your regex is relying on explicit conditions(space after letters).
matches = re.findall(r"([A-Z]+\s?[A-Z]+[^a-z0-9\W])",text)
Capture A to Z repetitions if there are no trailing lowercase or none-alphabet character.
This one does the job:
import re
text = "tHE following words aRe aLL CaPS. ThE following word Is in CAPS."
matches = re.findall(r"(\b(?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+)\b(?:\s+(?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+)\b)*)",text)
print matches
Output:
['tHE', 'aLL CaPS', 'ThE', 'Is', 'CAPS']
Explanation:
( : start group 1
\b : word boundary
(?: : start non capture group
[A-Z]+ : 1 or more capitals
[a-z]? : 0 or 1 small letter
[A-Z]* : 0 or more capitals
| : OR
[A-Z]* : 0 or more capitals
[a-z]? : 0 or 1 small letter
[A-Z]+ : 1 or more capitals
) : end group
\b : word boundary
(?: : non capture group
\s+ : 1 or more spaces
(?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+) : same as above
\b : word boundary
)* : 0 or more time the non capture group
) : end group 1
Assuming you want to start and end on a letter, and only include letters and whitespace
\b([A-Z][A-Z\s]*[A-Z]|[A-Z])\b
|[A-Z] to capture just I or A