Mahout precomputed Item-item similarity - slow recommendation

筅森魡賤 提交于 2020-01-03 03:45:23

问题


I am having performance issues with precomuted item-item similarities in Mahout.

I have 4 million users with roughly the same amount of items, with around 100M user-item preferences. I want to do content-based recommendation based on the Cosine similarity of the TF-IDF vectors of the documents. Since computing this on the fly is slow, I precomputed the pairwise similarity of the top 50 most similar documents as follows:

  1. I used seq2sparse to produce TF-IDF vectors.
  2. I used mahout rowId to produce mahout matrix
  3. I used mahout rowSimilarity -i INPUT/matrix -o OUTPUT -r 4587604 --similarityClassname SIMILARITY_COSINE -m 50 -ess to produce the top 50 most similar documents

I used hadoop to precompute all of this. For 4 million items, the output was only 2.5GB.

Then I loaded the content of the files produced by the reducers into Collection<GenericItemSimilarity.ItemItemSimilarity> corrMatrix = ... using the docIndex to decode the ids of the documents. They were already integers, but rowId have decoded them starting from 1, so I have to get it back.

For recommendation I use the following code:

ItemSimilarity similarity = new GenericItemSimilarity(correlationMatrix);

CandidateItemsStrategy candidateItemsStrategy = new SamplingCandidateItemsStrategy(1, 1, 1, model.getNumUsers(),  model.getNumItems());
MostSimilarItemsCandidateItemsStrategy mostSimilarItemsCandidateItemsStrategy = new SamplingCandidateItemsStrategy(1, 1, 1, model.getNumUsers(),  model.getNumItems());

Recommender recommender = new GenericItemBasedRecommender(model, similarity, candidateItemsStrategy, mostSimilarItemsCandidateItemsStrategy);

I am trying it with limited data model (1.6M items), but I loaded all the item-item pairwise similarities in memory. I manage to load everything in main memory using 40GB.

When I want to do recommendation for one user

Recommender cachingRecommender = new CachingRecommender(recommender);
List<RecommendedItem> recommendations = cachingRecommender.recommend(userID, howMany);

The elapsed time for the recommendation process is 554.938583083 seconds, and besides it did not produce any recommendation. Right now I am really concern about the performance of the recommendation. I played with the numbers of CandidateItemsStrategy and MostSimilarItemsCandidateItemsStrategy, but I didn't get any improvements in performance.

Isn't it the idea of precomputing everything suppose to speed up the recommendation process? Could someone please help me and tell me where I am doing wrong and what I am doing wrong. Also why loading the parwise similarities in main memory explodes exponentially? 2.5GB of files was loaded in 40GB of main memory in Collection<GenericItemSimilarity.ItemItemSimilarity> mahout matrix?. I know that the files are serialized using IntWritable, VectorWritable hashMap key-values, and the key has to repeat for every vector value in the ItemItemSimilarity matrix, but this is little too much, don't you think?

Thank you in advance.


回答1:


I stand corrected about the time needed for computing the recommendation using Collection for precomputed values. Apparently I have put the long startTime = System.nanoTime();on the top of my code, not before List<RecommendedItem> recommendations = cachingRecommender.recommend(userID, howMany);. This counted the time needed to load the dataset and the precomputed item-item similarities into the main memory.

However I stand behind the memory consumptions. I improved it though using custom ItemSimilarity and loading a HashMap<Long, HashMap<Long, Double> of the precomputed similarity. I used the trove library in order to reduce the space requirements.

Here is a detail code. The custom ItemSimilarity:

public class TextItemSimilarity implements ItemSimilarity{

    private TLongObjectHashMap<TLongDoubleHashMap> correlationMatrix;

    public WikiTextItemSimilarity(TLongObjectHashMap<TLongDoubleHashMap> correlationMatrix){
        this.correlationMatrix = correlationMatrix;
    }

    @Override
    public void refresh(Collection<Refreshable> alreadyRefreshed) {
    }

    @Override
    public double itemSimilarity(long itemID1, long itemID2) throws TasteException {
        TLongDoubleHashMap similarToItemId1 = correlationMatrix.get(itemID1);   
        if(similarToItemId1 != null && !similarToItemId1.isEmpty() &&  similarToItemId1.contains(itemID2)){
            return similarToItemId1.get(itemID2);
        }   
        return 0;
    }
    @Override
    public double[] itemSimilarities(long itemID1, long[] itemID2s) throws TasteException {
        double[] result = new double[itemID2s.length];
        for (int i = 0; i < itemID2s.length; i++) {
            result[i] = itemSimilarity(itemID1, itemID2s[i]);
        }
        return result;
    }
    @Override
    public long[] allSimilarItemIDs(long itemID) throws TasteException {
        return correlationMatrix.get(itemID).keys();
    }
}

The total memory consumption together with my data set using Collection<GenericItemSimilarity.ItemItemSimilarity> is 30GB, and when using TLongObjectHashMap<TLongDoubleHashMap> and the custom TextItemSimilarity the space requirement is 17GB. The time performance is 0.05 sec using Collection<GenericItemSimilarity.ItemItemSimilarity>, and 0.07 sec using TLongObjectHashMap<TLongDoubleHashMap>. Also I believe that big role in the performance plays using CandidateItemsStrategy and MostSimilarItemsCandidateItemsStrategy

I guess if you want to save some space use trove HashMap, and if you want just little better performance, you can use Collection<GenericItemSimilarity.ItemItemSimilarity>.



来源:https://stackoverflow.com/questions/18587938/mahout-precomputed-item-item-similarity-slow-recommendation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!