all possible wordform completions of a (biomedical) word's stem

前端未结

关注

 1  781

时光说笑

I\'m familiar with word stemming and completion from the tm package in R.

I\'m trying to come up with a quick and dirty method for finding all variants of a given

相关标签:

1条回答

执念已碎

2021-01-14 19:25

This solution requires preprocessing your corpus. But once that is done it is a very quick dictionary lookup.

from collections import defaultdict
from stemming.porter2 import stem

with open('/usr/share/dict/words') as f:
    words = f.read().splitlines()

stems = defaultdict(list)

for word in words:
    word_stem = stem(word)
    stems[word_stem].append(word)

if __name__ == '__main__':
    word = 'leukocyte'
    word_stem = stem(word)
    print(stems[word_stem])

For the /usr/share/dict/words corpus, this produces the result

['leukocyte', "leukocyte's", 'leukocytes']

It uses the stemming module that can be installed with

pip install stemming

0 讨论(0)