NLP: Building (small) corpora, or “Where to get lots of not-too-specialized English-language text files?”

前端未结

关注

 7  823

Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a

相关标签:

7条回答

温柔的废话

2021-01-13 03:47
- Use the Wikipedia dumps
  - needs lots of cleanup
- See if anything in nltk-data helps you
  - the corpora are usually quite small
- the Wacky people have some free corpora
  - tagged
  - you can spider your own corpus using their toolkit
- Europarl is free and the basis of pretty much every academic MT system
  - spoken language, translated
- The Reuters Corpora are free of charge, but only available on CD
You can always get your own, but be warned: HTML pages often need heavy cleanup, so restrict yourself to RSS feeds.

If you do this commercially, the LDC might be a viable alternative.
0 讨论(0)
发布评论:

提交评论
- 加载中...
猫巷女王i

2021-01-13 03:53

If you're willing to pay money, you should check out the data available at the Linguistic Data Consortium, such as the Penn Treebank.

0 讨论(0)
发布评论:

提交评论
- 加载中...
伪装坚强ぢ

2021-01-13 04:03

Looking into the wikipedia data I noticed that they had done some analysis on bodies of tv and movie scripts. I thought that might interesting text but not readily accessible -- it turns out it is everywhere, and it is structured and predictable enough that it should be possible clean it up. This site, helpfully titled "A bunch of movie scripts and screenplays in one location on the 'net", would probably be useful to anyone who stumbles on this thread with a similar question.

0 讨论(0)
发布评论:

提交评论
- 加载中...
傲寒

2021-01-13 04:06

You can get quotations content (in limited form) here: http://quotationsbook.com/services/

This content also happens to be on Freebase.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2021-01-13 04:07

You've covered the obvious ones. The only other areas that I can think of too supplement:

1) News articles / blogs.

2) Magazines are posting a lot of free material online, and you can get a good cross section of topics.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野性不改

2021-01-13 04:10

Wikipedia sounds like the way to go. There is an experimental Wikipedia API that might be of use, but I have no clue how it works. So far I've only scraped Wikipedia with custom spiders or even wget.

Then you could search for pages that offer their full article text in RSS feeds. RSS, because no HTML tags get in your way.

Scraping mailing lists and/or the Usenet has several disatvantages: you'll be getting AOLbonics and Techspeak, and that will tilt your corpus badly.

The classical corpora are the Penn Treebank and the British National Corpus, but they are paid for. You can read the Corpora list archives, or even ask them about it. Perhaps you will find useful data using the Web as Corpus tools.

I actually have a small project in construction, that allows linguistic processing on arbitrary web pages. It should be ready for use within the next few weeks, but it's so far not really meant to be a scraper. But I could write a module for it, I guess, the functionality is already there.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页