Extracting whole words

送分小仙女□ 提交于 2019-11-27 14:48:39

If you restrict yourself to ASCII letters, then use (with the re.I option set)

\b[a-z]+\b

\b is a word boundary anchor, matching only at the start and end of alphanumeric "words". So \b[a-z]+\b matches pie, but not pie21 or 21pie.

To also allow other non-ASCII letters, you can use something like this:

\b[^\W\d_]+\b

which also allows accented characters etc. You may need to set the re.UNICODE option, especially when using Python 2, in order to allow the \w shorthand to match non-ASCII letters.

[^\W\d_] as a negated character class allows any alphanumeric character except for digits and underscore.

Are you familiar with word boundaries? (\b). You can extract word's using the \b around the sequence and matching the alphabet within:

\b([a-zA-Z]+)\b

For instance, this will grab whole words but stop at tokens such as hyphens, periods, semi-colons, etc.

You can the \b sequence, and others, over at the python manual

EDIT Also, if you're looking to about a number following or preceding the match, you can use a negative look-ahead/behind:

(?!\d)   # negative look-ahead for numbers
(?<!\d)  # negative look-behind for numbers

What about:

import re
yourString="pie 42 http://foo.com GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA  pie42"
filter (lambda x:re.match("^[a-zA-Z]+$",x),[x for x in set(re.split("[\s:/,.:]",yourString))])

Note that:

  • split explodes your string into potential candidates => returns a list of "potential words"
  • set makes unicity filtering => transforms the list in set, thus removing entries appearing more than once. This step is not mandatory.
  • filter reduces the number of candidates : takes a list, applies a test function to each element, and returns a list of the element succeeding the test. In our case, the test function is "anonymous"
  • lambda : anonymous function, taking an item and checking if it's a word (upper or lower letters only)

EDIT : added some explanations

Sample code

print re.search(ur'(?u)ривет\b', ur'Привет')
print re.search(ur'(?u)\bривет\b', ur'Привет')

or

s = ur"abcd ААБВ"
import re
rx1 = re.compile(ur"(?u)АБВ")
rx2 = re.compile(ur"(?u)АБВ\b")
rx3 = re.compile(ur"(?u)\bАБВ\b")
print rx1.findall(s)
print rx2.findall(s)
print rx3.findall(s)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!