NLP: Building (small) corpora, or “Where to get lots of not-too-specialized English-language text files?”

前端 未结 7 814
温柔的废话
温柔的废话 2021-01-13 03:41

Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a

7条回答
  •  挽巷
    挽巷 (楼主)
    2021-01-13 04:10

    Wikipedia seems to be the best way. Yes you'd have to parse the output. But thanks to wikipedia's categories you could easily get different types of articles and words. e.g. by parsing all the science categories you could get lots of science words. Details about places would be skewed towards geographic names, etc.

提交回复
热议问题