问题
I have 3 sets of sentences (varying in word counts), but I don't know how to extract features from the text such that the input dimension will remain the same.
For example, I've tried bag-of-words but, since the word-count variation causes input-dimension variation, I eventually get errors.
I would much appreciate it if you could show me an approach to preparing the string data for the neural network.
Thank you!
(Python 2.7 in Windows 7)
回答1:
How to format the input
This is an extraction from wikipedia.org
Here are two simple text documents:
John likes to watch movies. Mary likes too.
John also likes to watch football games.
Based on these two text documents, a dictionary is constructed as:
{
"John": 1,
"likes": 2,
"to": 3,
"watch": 4,
"movies": 5,
"also": 6,
"football": 7,
"games": 8,
"Mary": 9,
"too": 10
}
which has 10 distinct words. And using the indexes of the dictionary, each document is represented by a 10-entry vector:
[1, 2, 1, 1, 1, 0, 0, 0, 1, 1]
[1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
Your input will remain the same size, regardless of the length of your document. I hope this will help you.
来源:https://stackoverflow.com/questions/18070368/pybrain-text-classification-data-and-input