I have a text file which contains a list of slang words and their substitutes in real English. I converted this text file into a dictionary using \":\" as a split point, and upo
I suggest replacing
slangs_re = re.compile('|'.join(slang_dict.keys()))
with
slangs_re = re.compile(r"(?<!\w)(?:{})(?!\w)".format('|'.join([re.escape(x) for x in slang_dict])))
and make sure you pass the keys sorted by length in the descending order.
from collections import OrderedDict
import re
test = "fitess no kome*"
slang_dict = {"Aha aha":"no", "fitess":"fitness", "damm":"damn", "kome*":"come", "ow wow":"rrf"}
slang_dict = OrderedDict(sorted(slang_dict.iteritems(), key=lambda x: len(x[0]), reverse=True))
slangs_re = re.compile(r"(?<!\w)(?:{})(?!\w)".format('|'.join([re.escape(x) for x in slang_dict])))
def correct_slang(s, slang_dict=slang_dict):
def replace(match):
return slang_dict[match.group(0)]
return slangs_re.sub(replace, s)
test = correct_slang(test)
print(test)
See the Python demo
This will check the terms as whole words and will escape the special chars in each of the search phrases so that no issues could occur when passing them to the regular expression engine.
If you are not interested in whole word matching, remove (?<!\w)
(checking for the leading word boundary) and (?!\w)
(checking for the trailing word boundary).