Looking for a database or text file of english words with their different forms

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-19 19:46:57

问题


I am working on a project and I need to get the root of a given word (stemming). As you know, the stemming algorithms that don't use a dictionary are not accurate. Also I tried the WordNet but it is not good for my project. I found phpmorphy project but it doesn't include API in Java.

At this time I am looking for a database or a text file of english words with their different forms. for example:

run running ran ... include including included ... ...

Thank you for your help or advise.


回答1:


You could download LanguageTool (Disclaimer: I'm the maintainer), which comes with a binary file english.dict. The LanguageTool Wiki describes how to dump that file as a text file:

java -jar morfologik-tools-1.6.0-standalone.jar fsa_dump -x -d english.dict

For run, the file will contain this:

ran run VBD
run run NN
run run VB
run run VBN
run run VBP
running run VBG
runs run NNS
runs run VBZ

The first column is the inflected form, the second is the base form, and the third is the part-of-speech tag according to the (slightly extended) Penn Treebank tagset.



来源:https://stackoverflow.com/questions/18366071/looking-for-a-database-or-text-file-of-english-words-with-their-different-forms

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!