How to vectorize text file in mahout?

早过忘川 提交于 2019-12-12 11:43:10

问题


I'm having a text file with label and tweets .

    positive,I love this car
    negative,I hate this book
    positive,Good product.

I need to convert each line into vector value.If i use seq2sparse command means the whole document gets converted to vector,but i need to convert each line as vector not the whole document. ex : key : positive value : vectorvalue(tweet) How can we achieve this in mahout?


/* Here is what i have done */

    StringTokenizer str= new StringTokenizer(line,",");
            String label=str.nextToken();
            while (str.hasMoreTokens())
            {
            tweetline =str.nextToken();
            System.out.println("Tweetline"+tweetline);
            StringTokenizer words = new StringTokenizer(tweetline," ");
            while(words.hasMoreTokens()){
            featureList.add(words.nextToken());}
            }
            Vector unclassifiedInstanceVector = new RandomAccessSparseVector(tweetline.split(" ").length);
 FeatureVectorEncoder vectorEncoder = new AdaptiveWordValueEncoder(label);
            vectorEncoder.setProbes(1);
            System.out.println("Feature List: "+featureList);
            for (Object feature: featureList) {
                vectorEncoder.addToVector((String) feature, unclassifiedInstanceVector);
            }
            context.write(new Text("/"+label), new VectorWritable(unclassifiedInstanceVector));

Thanks in advance


回答1:


You can write it to app hdfs path with SequenceFile.Writer

           FS = FileSystem.get(HBaseConfiguration.create());
           String newPath =   "/foo/mahouttest/part-r-00000";
           Path newPathFile = new Path(newPath);
           Text key = new Text();
           VectorWritable value = new VectorWritable();
           SequenceFile.Writer writer = SequenceFile.createWriter(FS, conf, newPathFile,
                key.getClass(), value.getClass());
                 .....
           key.set("c/"+label);
           value.set(unclassifiedInstanceVector );
           writer.append(key,value);


来源:https://stackoverflow.com/questions/15540387/how-to-vectorize-text-file-in-mahout

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!