Capture all consecutive all-caps words with regex in python?

后端 未结 4 953
佛祖请我去吃肉
佛祖请我去吃肉 2021-02-10 20:08

I am trying to match all consecutive all caps words/phrases using regex in Python. Given the following:

    text = \"The following words are ALL CAPS. The follow         


        
相关标签:
4条回答
  • 2021-02-10 20:48

    Keeping your regex, you can use strip() and filter:

    string = "The following words are ALL CAPS. The following word is in CAPS."
    result = filter(None, [x.strip() for x in re.findall(r"\b[A-Z\s]+\b", string)])
    # ['ALL CAPS', 'CAPS']
    
    0 讨论(0)
  • 2021-02-10 20:54

    Your regex is relying on explicit conditions(space after letters).

    matches = re.findall(r"([A-Z]+\s?[A-Z]+[^a-z0-9\W])",text)
    

    Capture A to Z repetitions if there are no trailing lowercase or none-alphabet character.

    0 讨论(0)
  • 2021-02-10 21:01

    This one does the job:

    import re
    text = "tHE following words aRe aLL CaPS. ThE following word Is in CAPS."
    matches = re.findall(r"(\b(?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+)\b(?:\s+(?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+)\b)*)",text)
    print matches
    

    Output:

    ['tHE', 'aLL CaPS', 'ThE', 'Is', 'CAPS']
    

    Explanation:

    (           : start group 1
      \b        : word boundary
      (?:       : start non capture group
        [A-Z]+  : 1 or more capitals
        [a-z]?  : 0 or 1 small letter
        [A-Z]*  : 0 or more capitals
       |        : OR
        [A-Z]*  : 0 or more capitals
        [a-z]?  : 0 or 1 small letter
        [A-Z]+  : 1 or more capitals
      )         : end group
      \b        : word boundary
      (?:       : non capture group
        \s+     : 1 or more spaces
        (?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+) : same as above
        \b      : word boundary
      )*        : 0 or more time the non capture group
    )           : end group 1
    
    0 讨论(0)
  • 2021-02-10 21:05

    Assuming you want to start and end on a letter, and only include letters and whitespace

    \b([A-Z][A-Z\s]*[A-Z]|[A-Z])\b
    

    |[A-Z] to capture just I or A

    0 讨论(0)
提交回复
热议问题