Extract sentences in nested parentheses using Python

后端 未结 2 1893
难免孤独
难免孤独 2021-01-26 06:48

I have multiple .txt files in a directory. Here is a sample of one of my .txt files:

kkkkk;

  select xx(\"xE\'\", PU         


        
相关标签:
2条回答
  • 2021-01-26 07:37

    As I said, this is a duplicate of Python: How to match nested parentheses with regex?, which shows several methods of handling nested parentheses, not all of which are regex-based. One way does require the regex module from the PYPI repository. If text contains the contents of the file, then the following should do what you want:

    import regex as re
    
    text = """kkkkk;
    
      select xx("xE'", PUT(xx.xxxx.),"'") jdfjhf:jhfjj from xxxx_x_xx_L ;
    quit;
    
    /* 1.xxxxx FROM xxxx_x_Ex_x */
    proc sql; ("TRUuuuth");
    hhhjhfjs as fdsjfsj:
    select * from djfkjd to jfkjs
    (
    SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
        FROM &xxx..xxx_xxx_xxE
    where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
          (xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))
     );
    
    
    jjjjjj;
    
      select xx("xE'", PUT(xx.xxxx.),"'") jdfjhf:jhfjj from xxxx_x_xx_L ;
    quit;
    
    /* 1.xxxxx FROM xxxx_x_Ex_x */ ()
    proc sql; ("CUuuiiiiuth");
    hhhjhfjs as fdsjfsj:
    select * from djfkjd to jfkjs
    (SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
        FROM &xxx..xxx_xxx_xxE
    where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
          (xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))(( ))
     );"""    
    
    regex = re.compile(r"""
    (?<rec> #capturing group rec
     \( #open parenthesis
     (?: #non-capturing group
        [^()]++ #anything but parenthesis one or more times without backtracking
      | #or
        (?&rec) #recursive substitute of group rec
     )*
     \) #close parenthesis
    )
    """, flags=re.VERBOSE)
    
    for m in regex.finditer(text):
        groups = m.captures('rec')
        group = groups[-1] # the last group is the outermost nesting
        if re.match(r'^\(+\s*\)+$', group):
            continue # not interested in empty parentheses such as '( )'
        print(group)
    

    Prints:

    ("xE'", PUT(xx.xxxx.),"'")
    ("TRUuuuth")
    (
    SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
        FROM &xxx..xxx_xxx_xxE
    where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
          (xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))
     )
    ("xE'", PUT(xx.xxxx.),"'")
    ("CUuuiiiiuth")
    (SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
        FROM &xxx..xxx_xxx_xxE
    where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
          (xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))(( ))
     )
    
    0 讨论(0)
  • 2021-01-26 07:47

    I would say my solution is not the optimised one, but it will solve your problem.

    Solution (Just replace test.txt with your file name)

    result = []
    with open('test.txt','r') as fd:
        # To keep track of '(' and ')' parentheses
        parentheses_stack = []
        # To keep track of complete word wrapped by ()
        complete_word = []
        # Iterate through each line in file
        for words in fd.readlines():
            # Iterate each character in a line
            for char in list(words):
                # Initialise the parentheses_stack when you find the first open '(' 
                if char == '(':
                    parentheses_stack.append(char)
                # Pop one open '(' from parentheses_stack when you find a ')'
                if char == ')':
                    if not parentheses_stack = []:
                        parentheses_stack.pop()
                    if parentheses_stack == []:
                        complete_word.append(char)
                # Collect characters in between the first '(' and last ')'
                if not parentheses_stack == []:
                    complete_word.append(char)
                else:
                    if not complete_word == []:
                        # Push the complete_word once you poped all '(' from parentheses_stack
                        result.append(''.join(complete_word))
                        complete_word = []
    
    
    
    for res in result:
        print(res)
    

    Result:

    WS:python rameshrv$ python3 /Users/rameshrv/Documents/python/test.py
    ("xE'", PUT(xx.xxxx.),"'")
    ("TRUuuuth")
    (
    SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
        FROM &xxx..xxx_xxx_xxE
    where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and 
          (xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))
     )
    ("xE'", PUT(xx.xxxx.),"'")
    ()
    ("CUuuiiiiuth")
    (SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
        FROM &xxx..xxx_xxx_xxE
    where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and 
          (xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))(( ))
     )
    
    0 讨论(0)
提交回复
热议问题