all possible wordform completions of a (biomedical) word's stem

前端 未结 1 781
时光说笑
时光说笑 2021-01-14 18:51

I\'m familiar with word stemming and completion from the tm package in R.

I\'m trying to come up with a quick and dirty method for finding all variants of a given

相关标签:
1条回答
  • 2021-01-14 19:25

    This solution requires preprocessing your corpus. But once that is done it is a very quick dictionary lookup.

    from collections import defaultdict
    from stemming.porter2 import stem
    
    with open('/usr/share/dict/words') as f:
        words = f.read().splitlines()
    
    stems = defaultdict(list)
    
    for word in words:
        word_stem = stem(word)
        stems[word_stem].append(word)
    
    if __name__ == '__main__':
        word = 'leukocyte'
        word_stem = stem(word)
        print(stems[word_stem])
    

    For the /usr/share/dict/words corpus, this produces the result

    ['leukocyte', "leukocyte's", 'leukocytes']
    

    It uses the stemming module that can be installed with

    pip install stemming
    
    0 讨论(0)
提交回复
热议问题