Is it possible use regex to remove small words in a text? For example, I have the following string (text):
anytext = \" in the echo chamber from Ontario duo
Certainly, it's not that hard either:
shortword = re.compile(r'\W*\b\w{1,3}\b')
The above expression selects any word that is preceded by some non-word characters (essentially whitespace or the start), is between 1 and 3 characters short, and ends on a word boundary.
>>> shortword.sub('', anytext)
' echo chamber from Ontario '
The \b
boundary matches are important here, they ensure that you don't match just the first or last 3 characters of a word.
The \W*
at the start lets you remove both the word and the preceding non-word characters so that the rest of the sentence still matches up. Note that punctuation is included in \W
, use \s
if you only want to remove preceding whitespace.
For what it's worth, this regular expression solution preserves extra whitespace between the rest of the words, while mgilson's version collapses multiple whitespace characters into one space. Not sure if that matters to you.
His list comprehension solution is the faster of the two:
>>> import timeit
>>> def re_remove(text): return shortword.sub('', text)
...
>>> def lc_remove(text): return ' '.join(word for word in text.split() if len(word)>3)
...
>>> timeit.timeit('remove(" in the echo chamber from Ontario duo ")', 'from __main__ import re_remove as remove')
7.0774190425872803
>>> timeit.timeit('remove(" in the echo chamber from Ontario duo ")', 'from __main__ import lc_remove as remove')
6.4250049591064453
I don't think you need a regex for this simple example anyway ...
' '.join(word for word in anytext.split() if len(word)>3)