问题
I'm having a text file with label and tweets .
positive,I love this car negative,I hate this book positive,Good product.
I need to convert each line into vector value.If i use seq2sparse
command means the whole document gets converted to vector,but i need to convert each line as vector not the whole document.
ex :
key : positive value : vectorvalue(tweet)
How can we achieve this in mahout?
/* Here is what i have done */
StringTokenizer str= new StringTokenizer(line,",");
String label=str.nextToken();
while (str.hasMoreTokens())
{
tweetline =str.nextToken();
System.out.println("Tweetline"+tweetline);
StringTokenizer words = new StringTokenizer(tweetline," ");
while(words.hasMoreTokens()){
featureList.add(words.nextToken());}
}
Vector unclassifiedInstanceVector = new RandomAccessSparseVector(tweetline.split(" ").length);
FeatureVectorEncoder vectorEncoder = new AdaptiveWordValueEncoder(label);
vectorEncoder.setProbes(1);
System.out.println("Feature List: "+featureList);
for (Object feature: featureList) {
vectorEncoder.addToVector((String) feature, unclassifiedInstanceVector);
}
context.write(new Text("/"+label), new VectorWritable(unclassifiedInstanceVector));
Thanks in advance
回答1:
You can write it to app hdfs path with SequenceFile.Writer
FS = FileSystem.get(HBaseConfiguration.create());
String newPath = "/foo/mahouttest/part-r-00000";
Path newPathFile = new Path(newPath);
Text key = new Text();
VectorWritable value = new VectorWritable();
SequenceFile.Writer writer = SequenceFile.createWriter(FS, conf, newPathFile,
key.getClass(), value.getClass());
.....
key.set("c/"+label);
value.set(unclassifiedInstanceVector );
writer.append(key,value);
来源:https://stackoverflow.com/questions/15540387/how-to-vectorize-text-file-in-mahout