问题
I am having performance issues with precomuted item-item similarities in Mahout.
I have 4 million users with roughly the same amount of items, with around 100M user-item preferences. I want to do content-based recommendation based on the Cosine similarity of the TF-IDF vectors of the documents. Since computing this on the fly is slow, I precomputed the pairwise similarity of the top 50 most similar documents as follows:
- I used
seq2sparse
to produce TF-IDF vectors. - I used
mahout rowId
to produce mahout matrix - I used mahout
rowSimilarity -i INPUT/matrix -o OUTPUT -r 4587604 --similarityClassname SIMILARITY_COSINE -m 50 -ess
to produce the top 50 most similar documents
I used hadoop to precompute all of this. For 4 million items, the output was only 2.5GB.
Then I loaded the content of the files produced by the reducers into Collection<GenericItemSimilarity.ItemItemSimilarity> corrMatrix = ...
using the docIndex
to decode the ids of the documents. They were already integers, but rowId have decoded them starting from 1, so I have to get it back.
For recommendation I use the following code:
ItemSimilarity similarity = new GenericItemSimilarity(correlationMatrix);
CandidateItemsStrategy candidateItemsStrategy = new SamplingCandidateItemsStrategy(1, 1, 1, model.getNumUsers(), model.getNumItems());
MostSimilarItemsCandidateItemsStrategy mostSimilarItemsCandidateItemsStrategy = new SamplingCandidateItemsStrategy(1, 1, 1, model.getNumUsers(), model.getNumItems());
Recommender recommender = new GenericItemBasedRecommender(model, similarity, candidateItemsStrategy, mostSimilarItemsCandidateItemsStrategy);
I am trying it with limited data model (1.6M items), but I loaded all the item-item pairwise similarities in memory. I manage to load everything in main memory using 40GB.
When I want to do recommendation for one user
Recommender cachingRecommender = new CachingRecommender(recommender);
List<RecommendedItem> recommendations = cachingRecommender.recommend(userID, howMany);
The elapsed time for the recommendation process is 554.938583083
seconds, and besides it did not produce any recommendation. Right now I am really concern about the performance of the recommendation. I played with the numbers of CandidateItemsStrategy
and MostSimilarItemsCandidateItemsStrategy
, but I didn't get any improvements in performance.
Isn't it the idea of precomputing everything suppose to speed up the recommendation process?
Could someone please help me and tell me where I am doing wrong and what I am doing wrong.
Also why loading the parwise similarities in main memory explodes exponentially? 2.5GB of files was loaded in 40GB of main memory in Collection<GenericItemSimilarity.ItemItemSimilarity>
mahout matrix?. I know that the files are serialized using IntWritable
, VectorWritable
hashMap key-values, and the key has to repeat for every vector value in the ItemItemSimilarity
matrix, but this is little too much, don't you think?
Thank you in advance.
回答1:
I stand corrected about the time needed for computing the recommendation using Collection for precomputed values. Apparently I have put the long startTime = System.nanoTime();
on the top of my code, not before List<RecommendedItem> recommendations = cachingRecommender.recommend(userID, howMany);
. This counted the time needed to load the dataset and the precomputed item-item similarities into the main memory.
However I stand behind the memory consumptions. I improved it though using custom ItemSimilarity
and loading a HashMap<Long, HashMap<Long, Double>
of the precomputed similarity. I used the trove library in order to reduce the space requirements.
Here is a detail code. The custom ItemSimilarity:
public class TextItemSimilarity implements ItemSimilarity{
private TLongObjectHashMap<TLongDoubleHashMap> correlationMatrix;
public WikiTextItemSimilarity(TLongObjectHashMap<TLongDoubleHashMap> correlationMatrix){
this.correlationMatrix = correlationMatrix;
}
@Override
public void refresh(Collection<Refreshable> alreadyRefreshed) {
}
@Override
public double itemSimilarity(long itemID1, long itemID2) throws TasteException {
TLongDoubleHashMap similarToItemId1 = correlationMatrix.get(itemID1);
if(similarToItemId1 != null && !similarToItemId1.isEmpty() && similarToItemId1.contains(itemID2)){
return similarToItemId1.get(itemID2);
}
return 0;
}
@Override
public double[] itemSimilarities(long itemID1, long[] itemID2s) throws TasteException {
double[] result = new double[itemID2s.length];
for (int i = 0; i < itemID2s.length; i++) {
result[i] = itemSimilarity(itemID1, itemID2s[i]);
}
return result;
}
@Override
public long[] allSimilarItemIDs(long itemID) throws TasteException {
return correlationMatrix.get(itemID).keys();
}
}
The total memory consumption together with my data set using Collection<GenericItemSimilarity.ItemItemSimilarity>
is 30GB, and when using TLongObjectHashMap<TLongDoubleHashMap>
and the custom TextItemSimilarity
the space requirement is 17GB.
The time performance is 0.05 sec using Collection<GenericItemSimilarity.ItemItemSimilarity>
, and 0.07 sec using TLongObjectHashMap<TLongDoubleHashMap>
. Also I believe that big role in the performance plays using CandidateItemsStrategy
and MostSimilarItemsCandidateItemsStrategy
I guess if you want to save some space use trove HashMap, and if you want just little better performance, you can use Collection<GenericItemSimilarity.ItemItemSimilarity>
.
来源:https://stackoverflow.com/questions/18587938/mahout-precomputed-item-item-similarity-slow-recommendation