Binary Feature Extraction

孤者浪人 提交于 2019-12-24 13:52:40

问题


I am a beginner in feature extraction for natural language processing purposes. I want to know how I can use a hashmap to extract features for a text. If each feature is a "key" in hashmap and its value is the "value" (all the features are binary, 0 or 1), does it mean that I need to have n hashmap (n is the number of words in the text)? Because for each word I need to extract the features.

Am I right?

Thanks in advance, Alice


回答1:


Yes you can implement this with a hash map however depending on the number of features and your memory requirements it may not be the best or fastest data-structure, it really depends on your domain. Generally, representing features as present or not will yield poor results. A better method is to use TF-IDF when weighting your features.

The approach you are talking about is the "bag-of-words" approach. This is where you tokenize the document base on word boundaries and use the words as features. As a first pass you should remove stop words (ie "a", "and", "the") and then normalize your data (ie Now == now == nOw). You can then perform word stemming to further reduce your vector size.

A good way to understate now to extract features is to take a look at MALLET. I have a very simple implementation of Naive Bayes with a parser for RCV-1 that you can look at for an example Naive Bayes



来源:https://stackoverflow.com/questions/15257553/binary-feature-extraction

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!