Lets say I have a list of strings,
string_lst = [\'fun\', \'dum\', \'sun\', \'gum\']
I want to make a regular expression, where at a point
In line with @vks reply - I feel this actually does the comeplete task..
finds = re.findall(r"(?=(\b" + '\\b|\\b'.join(string_lst) + r"\b))", x)
Adding word boundary completes the task!
string_lst = ['fun', 'dum', 'sun', 'gum']
x="I love to have fun."
print re.findall(r"(?=("+'|'.join(string_lst)+r"))",x)
You cannot use match
as it will match from start.Use findall
instead.
Output:['fun']
using search
you will get only the first match.So use findall
instead.
Also use lookahead
if you have overlapping matches not starting at the same point.
Except for the regular expression, you can use list comprehension, hope it's not off the topic.
import re
def match(input_string, string_list):
words = re.findall(r'\w+', input_string)
return [word for word in words if word in string_list]
>>> string_lst = ['fun', 'dum', 'sun', 'gum']
>>> match("I love to have fun.", string_lst)
['fun']
You should make sure to escape the strings correctly before combining into a regex
>>> import re
>>> string_lst = ['fun', 'dum', 'sun', 'gum']
>>> x = "I love to have fun."
>>> regex = re.compile("(?=(" + "|".join(map(re.escape, string_lst)) + "))")
>>> re.findall(regex, x)
['fun']
regex module has named lists (sets actually):
#!/usr/bin/env python
import regex as re # $ pip install regex
p = re.compile(r"\L<words>", words=['fun', 'dum', 'sun', 'gum'])
if p.search("I love to have fun."):
print('matched')
Here words
is just a name, you can use anything you like instead.
.search()
methods is used instead of .*
before/after the named list.
To emulate named lists using stdlib's re
module:
#!/usr/bin/env python
import re
words = ['fun', 'dum', 'sun', 'gum']
longest_first = sorted(words, key=len, reverse=True)
p = re.compile(r'(?:{})'.format('|'.join(map(re.escape, longest_first))))
if p.search("I love to have fun."):
print('matched')
re.escape()
is used to escape regex meta-characters such as .*?
inside individual words (to match the words literally).
sorted()
emulates regex
behavior and it puts the longest words first among the alternatives, compare:
>>> import re
>>> re.findall("(funny|fun)", "it is funny")
['funny']
>>> re.findall("(fun|funny)", "it is funny")
['fun']
>>> import regex
>>> regex.findall(r"\L<words>", "it is funny", words=['fun', 'funny'])
['funny']
>>> regex.findall(r"\L<words>", "it is funny", words=['funny', 'fun'])
['funny']