Ideas for Natural Language Processing project? [closed]

后端未结

关注

 9  2095

一个人的身影

相关标签:

9条回答

忘了有多久

2021-01-30 05:25

I don't know that a whole lot of the curriculum for your class will be that useful for either problems 1 or 2. Some of the better techniques for these kinds of problems do really simple linguistic stuff (like part of speech tagging, simply removing stop words, and looking at bigrams and trigrams), and have a machine-learning text classification component that's not too sophisticated on its own (standard techniques like Naive Bayesian classifiers, Maximum Entropy classifiers, Support Vector Machines are pretty much black boxes algorithm-wise and perform well). Have a look at these survey papers about topical text classification and authorship detection to get an idea of where you can get started.

Something better suited to the curriculum you've described might be to construct a morphological analyzer for a foreign language that you're familiar with, or to construct a stemmer (a poor man's version of a morphological analyzer) that maps morphologically-related terms to the same entry in an index -- something that can be used by search engines.

If you don't need to come up with a new technique for your class (i.e. if you're an undergrad), then there are a large number of standard NLP tasks that you could implement in OCaml, for example a parser trained on the Penn Treebank, a parser for some other grammar formalism, a part-of-speech tagger, or literally dozens of other applications.

0 讨论(0)
发布评论:

提交评论
- 加载中...
暗喜

2021-01-30 05:28

1a. script that scans facebook... How do you plan on defining 'obnoxious'. that sounds pretty difficult.

2a. similarity. syntax and semantics, and other 'linguisticy' stuff sounds difficult. people have done this sort of thing with other more numerical methods with great success though, for example using the singular value decomposition. i think that this method has also been incorporated into software used to check for plagiarism. this method is also often referred to as latent semantic analysis or latent semantic mapping.

svdlibc: http://tedlab.mit.edu/~dr/svdlibc/

1b. fsm stuff. im not sure what you mean by 'proving that a transducer is minimal'. this is a pretty standard operation and is included in pretty much any toolkit you might encounter. if you are interested in fsms, take a look at the

AT&T toolkit: http://www2.research.att.com/~fsmtools/fsm/

or OpenFST toolkit: http://www.openfst.org/

fsms are growing in popularity as a principled, unified method for doing speech recognition. my graduate work focuses on this subject, and it is indeed very interesting.

what about building an hmm-based parser or chunker, or a simple viterbi decoder? if you put together a decent training set (you'd have to tag it yourself to begin with) you could approximate a simple version of your 'obnoxious comments' tagger and use that, maybe with some sort of classifier to 'censor' or remove the obnoxious comments.

0 讨论(0)
发布评论:

提交评论
- 加载中...
北荒

2021-01-30 05:30

Just wanted to say that I love your second idea. It looks like it has everything to do with NLP. It seems very interesting to me. IMO, I would go with this one, for sure. But you should also consider the one that attracts you more and not only opinions of others.

As for the language to choose, I think you should choose the one that you are more comfortable with. I'm not familiar with libraries but I've seen many NLP things done in Python and Java.

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2

热议问题