问题
I am looking for a corpus of text to run some trial fulltext style data searches across. Either something I can download, or a system that generates it. Something a bit more random would be better e.g. 1,000,000 wikipedia articles in a format easy to insert into a 2 column database (id, text).
Any ideas or suggestions?
回答1:
I'll throw this out there since I'm familiar with it - Prosper.com makes their member loan listings available for analysis through an XML export. The export would have about 50,000 loan requests with descriptions and over 1,000,000 member profiles (although many of those are empty).
回答2:
Project Gutenberg has 32000 books available.
Edit: As of now (17.06.16) there are 52,284 free ebooks to download as plain text file in UTF-8 in a wide variety of topics (From science to religion). Also in formats EPUB, Kindle or html format. Check here Project Gutenberg
回答3:
Why not use a Wikipedia dump?
来源:https://stackoverflow.com/questions/3095813/looking-for-dataset-to-test-fulltext-style-searches-on