Natural English language words

后端 未结 6 1092
醉酒成梦
醉酒成梦 2021-01-31 19:20

I need the most exhaustive English word list I can find for several types of language processing operations, but I could not find anything on the internet that has good enough q

相关标签:
6条回答
  • 2021-01-31 19:26

    Kevin's wordlists is the best I know just for lists of words.

    WordNet is better if you want to know about things being nouns, verbs etc, synonyms, etc.

    0 讨论(0)
  • 2021-01-31 19:33

    There aren't too many base words(171k according to this- oxford. Which is what I remember being told in my CS program in college. But if include all forms of the words- then it rises considerably.

    That said, why not make one yourself? Get a Wikipedia dump and parse it and create a set of all tokens you encounter.

    Expect misspellings though- like all things crowd-sources there will be errors.

    0 讨论(0)
  • 2021-01-31 19:38

    Try directly Wikipedia's extracts : http://dbpedia.org

    0 讨论(0)
  • 2021-01-31 19:40

    `The "million word" hoax rolls along', I see ;-)

    How to make your word lists longer: given a noun, add any of the following to it: non-, pseudo-, semi-, -arific, -geek, ...; mutatis mutandis for verbs etc.

    0 讨论(0)
  • 2021-01-31 19:44

    Who told you there was 1 million words? According to Wikipedia, the Oxford English Dictionary only has 600,000. And the OED tries to include all technical and slang terms that are used.

    0 讨论(0)
  • 2021-01-31 19:47

    I did research for Purdue on controlled / natural english and language domain knowledge processing.

    I would take a look at the attempto project: http://attempto.ifi.uzh.ch/site/description/ which is a project to help build a controlled natural english.

    You can download their entire word lexicon at: http://attempto.ifi.uzh.ch/site/downloads/files/clex-6.0-080806.zip it has ~ 100,000 natural English words.

    You can also supply your own lexicon for domain specific words, this is what we did in our research. They offer webservices to parse and format natural english text.

    0 讨论(0)
提交回复
热议问题