How get all matches using str.contains in python regex?

前端 未结 2 1331
闹比i
闹比i 2021-01-27 03:37

I have a data frame, in which I need to find all the possible matches rows which match with terms. My code is

texts = [\'foo abc\', \'foobar xyz\',         


        
相关标签:
2条回答
  • 2021-01-27 03:50

    The longer alternatives should come before the shorter ones, thus, you need to sort the keywords by length in the descending order:

    pat = r'\b(?:{})\b'.format('|'.join(sorted(terms,key=len,reverse=True)))
    

    The result will be \b(?:foo baz|foo|baz)\b pattern. It will first try to match foo baz, then foo, then baz. If foo baz is found, the match is returned, then the next match is searched for from the end of the match, so you won't match foo or baz found with the previous match again.

    See more on this in "Remember That The Regex Engine Is Eager".

    0 讨论(0)
  • 2021-01-27 03:54

    Instead of using the regex pattern for checking the presence of terms,

    #create pattern
    p = re.compile(pat)
    
    #search for pattern in the column
    results = [p.findall(text) for text in df.Match_text.tolist()]
    

    Try using a simple lookup of terms in the text like this.

    #search for each term in the column
    results = [[term for term in terms if term in text] for text in df.Match_text.tolist()]
    

    Output for the above looks like this,

        Match_text  results
    0   foo abc [foo]
    3   baz 45  [baz]
    6   foo baz [foo, baz, foo baz]
    

    NOTE : There is a time complexity associated to this method.

    0 讨论(0)
提交回复
热议问题