Regex Expression For a String

前端 未结 3 592
余生分开走
余生分开走 2021-01-23 10:22

I want to split the string in python.

Sample string:

Hi this is ACT I. SCENE 1 and SCENE 2 and this is ACT II. SCENE 1 and SCENE 2 and more

相关标签:
3条回答
  • 2021-01-23 10:54

    Here is a working script, albeit a bit hackish:

    inp = "Hi this is ACT I. SCENE 1 and SCENE 2 and this is ACT II. SCENE 1 and SCENE 2 and more"
    parts = re.findall(r'[A-Z]{2,}(?: [A-Z0-9.]+)*|(?![A-Z]{2})\w+(?: (?![A-Z]{2})\w+)*', inp)
    print(parts)
    

    This prints:

    ['Hi this is', 'ACT I. SCENE 1', 'and', 'SCENE 2', 'and this is', 'ACT II. SCENE 1',
     'and', 'SCENE 2', 'and more']
    

    An explanation of the regex logic, which uses an alternation to match one of two cases:

    [A-Z]{2,}              match TWO or more capital letters
    (?: [A-Z0-9.]+)*       followed by zero or more words, consisting only of
                           capital letters, numbers, or period
    |                      OR
    (?![A-Z]{2})\w+        match a word which does NOT start with two capital letters
    (?: (?![A-Z]{2})\w+)*  then match zero or more similar terms
    
    0 讨论(0)
  • 2021-01-23 10:56

    If I understand your requirements correctly, you may use the following pattern:

    (?:ACT|SCENE).+?\d+|\S.*?(?=\s?(?:ACT|SCENE|$))
    

    Demo.

    Breakdown:

    (?:                    # Start of a non-capturing group.
        ACT|SCENE          # Matches either 'ACT' or 'SCENE'.
    )                      # Close the non-capturing group.
    .+?                    # Matches one or more characters (lazy matching).
    \d+                    # Matches one or more digits.
    |                      # Alternation (OR).
    \S                     # Matches a non-whitespace character (to trim spaces).
    .*?                    # Matches zero or more characters (lazy matching).
    (?=                    # Start of a positive Lookahead (i.e., followed by...).
        \s?                # An optional whitespace character (to trim spaces).
        (?:ACT|SCENE|$)    # Followed by either 'ACT' or 'SCENE' or the end of the string.
    )                      # Close the Lookahead.
    

    Python example:

    import re
    
    regex = r"(?:ACT|SCENE).+?\d+|\S.*?(?=\s?(?:ACT|SCENE|$))"
    test_str = "Hi this is ACT I. SCENE 1 and SCENE 2 and this is ACT II. SCENE 1 and SCENE 2 and more"
    
    list = re.findall(regex, test_str)
    print(list)
    

    Output:

    ['Hi this is', 'ACT I. SCENE 1', 'and', 'SCENE 2', 'and this is', 'ACT II. SCENE 1', 'and', 'SCENE 2', 'and more']
    

    Try it online.

    0 讨论(0)
  • 2021-01-23 10:57

    You can use re.findall:

    import re
    s = 'Hi this is ACT I. SCENE 1 and SCENE 2 and this is ACT II. SCENE 1 and SCENE 2 and more'
    new_s = list(map(str.strip, re.findall('[A-Z\d\s\.]{2,}|^[A-Z]{1}[a-z\s]+|[a-z\s]+', s)))
    

    Output:

    ['Hi this is', 'ACT I. SCENE 1', 'and', 'SCENE 2', 'and this is', 'ACT II. SCENE 1', 'and', 'SCENE 2', 'and more']
    
    0 讨论(0)
提交回复
热议问题