Extracting whole words

前端未结

关注

 4  2100

I have a large set of real-world text that I need to pull words out of to input into a spell checker. I\'d like to extract as many meaningful words as possible with

相关标签:

4条回答

悲哀的现实

2020-12-03 16:05

Sample code

print re.search(ur'(?u)ривет\b', ur'Привет')
print re.search(ur'(?u)\bривет\b', ur'Привет')

s = ur"abcd ААБВ"
import re
rx1 = re.compile(ur"(?u)АБВ")
rx2 = re.compile(ur"(?u)АБВ\b")
rx3 = re.compile(ur"(?u)\bАБВ\b")
print rx1.findall(s)
print rx2.findall(s)
print rx3.findall(s)

0 讨论(0)

遥遥无期

2020-12-03 16:11
If you restrict yourself to ASCII letters, then use (with the re.I option set)
```
\b[a-z]+\b
```
\b is a word boundary anchor, matching only at the start and end of alphanumeric "words". So \b[a-z]+\b matches pie, but not pie21 or 21pie.

To also allow other non-ASCII letters, you can use something like this:
```
\b[^\W\d_]+\b
```
which also allows accented characters etc. You may need to set the re.UNICODE option, especially when using Python 2, in order to allow the \w shorthand to match non-ASCII letters.

[^\W\d_] as a negated character class allows any alphanumeric character except for digits and underscore.
0 讨论(0)
发布评论:

提交评论
- 加载中...
深忆病人

2020-12-03 16:25
What about:
```
import re
yourString="pie 42 http://foo.com GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA  pie42"
filter (lambda x:re.match("^[a-zA-Z]+$",x),[x for x in set(re.split("[\s:/,.:]",yourString))])
```
Note that:
- split explodes your string into potential candidates => returns a list of "potential words"
- set makes unicity filtering => transforms the list in set, thus removing entries appearing more than once. This step is not mandatory.
- filter reduces the number of candidates : takes a list, applies a test function to each element, and returns a list of the element succeeding the test. In our case, the test function is "anonymous"
- lambda : anonymous function, taking an item and checking if it's a word (upper or lower letters only)
EDIT : added some explanations
0 讨论(0)
发布评论:

提交评论
- 加载中...
无人及你

2020-12-03 16:27
Are you familiar with word boundaries? (\b). You can extract word's using the \b around the sequence and matching the alphabet within:
```
\b([a-zA-Z]+)\b
```
For instance, this will grab whole words but stop at tokens such as hyphens, periods, semi-colons, etc.

You can the \b sequence, and others, over at the python manual

EDIT Also, if you're looking to about a number following or preceding the match, you can use a negative look-ahead/behind:
```
(?!\d)   # negative look-ahead for numbers
(?<!\d)  # negative look-behind for numbers
```
0 讨论(0)
发布评论:

提交评论
- 加载中...