I have a large set of real-world text that I need to pull words out of to input into a spell checker. I\'d like to extract as many meaningful words as possible with
Sample code
print re.search(ur'(?u)ривет\b', ur'Привет')
print re.search(ur'(?u)\bривет\b', ur'Привет')
or
s = ur"abcd ААБВ"
import re
rx1 = re.compile(ur"(?u)АБВ")
rx2 = re.compile(ur"(?u)АБВ\b")
rx3 = re.compile(ur"(?u)\bАБВ\b")
print rx1.findall(s)
print rx2.findall(s)
print rx3.findall(s)
If you restrict yourself to ASCII letters, then use (with the re.I
option set)
\b[a-z]+\b
\b
is a word boundary anchor, matching only at the start and end of alphanumeric "words". So \b[a-z]+\b
matches pie
, but not pie21
or 21pie
.
To also allow other non-ASCII letters, you can use something like this:
\b[^\W\d_]+\b
which also allows accented characters etc. You may need to set the re.UNICODE
option, especially when using Python 2, in order to allow the \w
shorthand to match non-ASCII letters.
[^\W\d_]
as a negated character class allows any alphanumeric character except for digits and underscore.
What about:
import re
yourString="pie 42 http://foo.com GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA pie42"
filter (lambda x:re.match("^[a-zA-Z]+$",x),[x for x in set(re.split("[\s:/,.:]",yourString))])
Note that:
EDIT : added some explanations
Are you familiar with word boundaries? (\b
). You can extract word's using the \b
around the sequence and matching the alphabet within:
\b([a-zA-Z]+)\b
For instance, this will grab whole words but stop at tokens such as hyphens, periods, semi-colons, etc.
You can the \b
sequence, and others, over at the python manual
EDIT Also, if you're looking to about a number following or preceding the match, you can use a negative look-ahead/behind:
(?!\d) # negative look-ahead for numbers
(?<!\d) # negative look-behind for numbers