Pybrain Text Classification: data and input

问题

I have 3 sets of sentences (varying in word counts), but I don't know how to extract features from the text such that the input dimension will remain the same.

For example, I've tried bag-of-words but, since the word-count variation causes input-dimension variation, I eventually get errors.

I would much appreciate it if you could show me an approach to preparing the string data for the neural network.

Thank you!

(Python 2.7 in Windows 7)

回答1:

How to format the input

This is an extraction from wikipedia.org

Here are two simple text documents:

John likes to watch movies. Mary likes too.
John also likes to watch football games.

Based on these two text documents, a dictionary is constructed as:

{
    "John": 1,
    "likes": 2,
    "to": 3,
    "watch": 4,
    "movies": 5,
    "also": 6,
    "football": 7,
    "games": 8,
    "Mary": 9,
    "too": 10
}

which has 10 distinct words. And using the indexes of the dictionary, each document is represented by a 10-entry vector:

[1, 2, 1, 1, 1, 0, 0, 0, 1, 1]
[1, 1, 1, 1, 0, 1, 1, 1, 0, 0]

Your input will remain the same size, regardless of the length of your document. I hope this will help you.

来源：https://stackoverflow.com/questions/18070368/pybrain-text-classification-data-and-input

标签

python

machine-learning

neural-network

feature-extraction

pybrain

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!