Does Mahout provide a way to determine similarity between content (for content-based recommendations)?

孤街浪徒 提交于 2019-11-30 16:19:00

问题


Does Mahout provide a way to determine similarity between content?

I would like to produce content-based recommendations as part of a web application. I know Mahout is good at taking user-ratings matrices and producing recommendations based off of them, but I am not interested in collaborative (ratings-based) recommendations. I want to score how well two pieces of text match and then recommend items that match most closely to text that I store for users in their user profile...

I've read Mahout's documentation, and it looks like it facilitates mainly the collaborative (ratings-based) recommendations, but not content-based recommendations... Is this true?


回答1:


That is not entirely true. Mahout does not have content-based recommender, but it does have algorithms for computing similarities between items based on the content. One of the most popular one is TF-IDF and cosine similarity. However, the computation is not on the fly, but is done offline. You need hadoop to compute the pairwise similarities based on the content more faster. The steps I am going to write are for MAHOUT 0.8. I am not sure if they changed it in 0.9.

Step 1. You need to convert your text documents into seq files. I lost the command for this in MAHOUT-0.8, but in 0.9 is something like this (Please check it for your version of MAHOUT):

$MAHOUT_HOME/bin/mahout seqdirectory
--input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY>
<-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}>
<-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64>
<-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>

Step 2. You need to convert your sequence files into sparse vectors like this:

$MAHOUT_HOME/bin/mahout seq2sparse \
   -i <SEQ INPUT DIR> \
   -o <VECTORS OUTPUT DIR> \
   -ow -chunk 100 \
   -wt tfidf \
   -x 90 \
   -seq \
   -ml 50 \
   -md 3 \
   -n 2 \
   -nv \
   -Dmapred.map.tasks=1000 -Dmapred.reduce.tasks=1000

where:

  • chunk is the size of the of the file.
  • x the maximum number of the term should occur in to be considered a part of the dictionary file. If it occurs less than -x, then it is considered as a stop word.
  • wt is weighting scheme.
  • md The minimum number of documents the term should occur in to be considered a part of the dictionary file. Any term with lesser frequency is ignored.
  • n The normalization value to use in the Lp space. A detailed explanation of normaliza- tion is given in section 8.4. The default scheme is to not normalize the weights. 2 is good for cosine distance, which we are using in clustering and for similarity
  • nv to get named vectors making further data files easier to inspect.

Step 3. Create a matrix from the vectors:

$MAHOUT_HOME/bin/mahout rowid -i <VECTORS OUTPUT DIR>/tfidf-vectors/part-r-00000 -o <MATRIX OUTPUT DIR>

Step 4. Create a collection of similar docs for each row of the matrix above. This will generate the 50 most similar docs to each doc in the collection.

 $MAHOUT_HOME/bin/mahout rowsimilarity -i <MATRIX OUTPUT DIR>/matrix -o <SIMILARITY OUTPUT DIR> -r <NUM OF COLUMNS FROM THE OUTPUT IN STEP 3> --similarityClassname SIMILARITY_COSINE -m 50 -ess -Dmapred.map.tasks=1000 -Dmapred.reduce.tasks=1000

This will produce a file with similarities between each item with the top 50 files based on the content.

Now, to use it in your recommendation process you need to read the file or load it into database, depending of how much resources you have. I loaded into main memory using Collection<GenericItemSimilarity.ItemItemSimilarity>. Here are two simple functions that did the job for me:

public static Collection<GenericItemSimilarity.ItemItemSimilarity> correlationMatrix(final File folder, TIntLongHashMap docIndex) throws IOException{
        Collection<GenericItemSimilarity.ItemItemSimilarity> corrMatrix = 
                new ArrayList<GenericItemSimilarity.ItemItemSimilarity>();

        ItemItemSimilarity itemItemCorrelation = null;

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);

        int n=0;
        for (final File fileEntry : folder.listFiles()) {
            if (fileEntry.isFile()) {
                if(fileEntry.getName().startsWith("part-r")){

                    SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(fileEntry.getAbsolutePath()), conf);

                    IntWritable key = new IntWritable();
                    VectorWritable value = new VectorWritable();
                    while (reader.next(key, value)) {

                        long itemID1 = docIndex.get(Integer.parseInt(key.toString()));

                        Iterator<Element> it = value.get().nonZeroes().iterator();

                        while(it.hasNext()){
                            Element next = it.next();
                            long itemID2 =  docIndex.get(next.index());
                            double similarity =  next.get();
                            //System.out.println(itemID1+ " : "+itemID2+" : "+similarity);

                            if (similarity < -1.0) {
                                similarity = -1.0;
                            } else if (similarity > 1.0) {
                                similarity = 1.0;
                            }


                            itemItemCorrelation = new GenericItemSimilarity.ItemItemSimilarity(itemID1, itemID2, similarity);

                            corrMatrix.add(itemItemCorrelation);
                        }
                    }
                    reader.close();
                    n++;
                    logger.info("File "+fileEntry.getName()+" readed ("+n+"/"+folder.listFiles().length+")");
                }
            }
        }

        return corrMatrix;
    }


public static TIntLongHashMap getDocIndex(String docIndex) throws IOException{
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);

        TIntLongHashMap map = new TIntLongHashMap();
        SequenceFile.Reader docIndexReader = new SequenceFile.Reader(fs, new Path(docIndex), conf);

        IntWritable key = new IntWritable();
        Text value = new Text();
        while (docIndexReader.next(key, value)) {
            map.put(key.get(), Long.parseLong(value.toString()));
        }

        return map;
    }

At the end, in your recommendation class you call this:

TIntLongHashMap docIndex = ItemPairwiseSimilarityUtil.getDocIndex(filename);
TLongObjectHashMap<TLongDoubleHashMap> correlationMatrix = ItemPairwiseSimilarityUtil.correlatedItems(folder, docIndex);

Where filename is your docIndex filename, and folder is the folder of the item-similarity files. At the end, this is nothing more than item-item based recommendation.

Hope this can help you



来源:https://stackoverflow.com/questions/22789986/does-mahout-provide-a-way-to-determine-similarity-between-content-for-content-b

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!