Does Mahout provide a way to determine similarity between content?
I would like to produce content-based recommendations as part of a web application. I know Mahout
That is not entirely true. Mahout does not have content-based recommender, but it does have algorithms for computing similarities between items based on the content. One of the most popular one is TF-IDF and cosine similarity. However, the computation is not on the fly, but is done offline. You need hadoop to compute the pairwise similarities based on the content more faster. The steps I am going to write are for MAHOUT 0.8. I am not sure if they changed it in 0.9.
Step 1. You need to convert your text documents into seq files. I lost the command for this in MAHOUT-0.8, but in 0.9 is something like this (Please check it for your version of MAHOUT):
$MAHOUT_HOME/bin/mahout seqdirectory
--input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY>
<-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}>
<-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64>
<-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>
Step 2. You need to convert your sequence files into sparse vectors like this:
$MAHOUT_HOME/bin/mahout seq2sparse \
-i <SEQ INPUT DIR> \
-o <VECTORS OUTPUT DIR> \
-ow -chunk 100 \
-wt tfidf \
-x 90 \
-seq \
-ml 50 \
-md 3 \
-n 2 \
-nv \
-Dmapred.map.tasks=1000 -Dmapred.reduce.tasks=1000
where:
Step 3. Create a matrix from the vectors:
$MAHOUT_HOME/bin/mahout rowid -i <VECTORS OUTPUT DIR>/tfidf-vectors/part-r-00000 -o <MATRIX OUTPUT DIR>
Step 4. Create a collection of similar docs for each row of the matrix above. This will generate the 50 most similar docs to each doc in the collection.
$MAHOUT_HOME/bin/mahout rowsimilarity -i <MATRIX OUTPUT DIR>/matrix -o <SIMILARITY OUTPUT DIR> -r <NUM OF COLUMNS FROM THE OUTPUT IN STEP 3> --similarityClassname SIMILARITY_COSINE -m 50 -ess -Dmapred.map.tasks=1000 -Dmapred.reduce.tasks=1000
This will produce a file with similarities between each item with the top 50 files based on the content.
Now, to use it in your recommendation process you need to read the file or load it into database, depending of how much resources you have. I loaded into main memory using Collection<GenericItemSimilarity.ItemItemSimilarity>
. Here are two simple functions that did the job for me:
public static Collection<GenericItemSimilarity.ItemItemSimilarity> correlationMatrix(final File folder, TIntLongHashMap docIndex) throws IOException{
Collection<GenericItemSimilarity.ItemItemSimilarity> corrMatrix =
new ArrayList<GenericItemSimilarity.ItemItemSimilarity>();
ItemItemSimilarity itemItemCorrelation = null;
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
int n=0;
for (final File fileEntry : folder.listFiles()) {
if (fileEntry.isFile()) {
if(fileEntry.getName().startsWith("part-r")){
SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(fileEntry.getAbsolutePath()), conf);
IntWritable key = new IntWritable();
VectorWritable value = new VectorWritable();
while (reader.next(key, value)) {
long itemID1 = docIndex.get(Integer.parseInt(key.toString()));
Iterator<Element> it = value.get().nonZeroes().iterator();
while(it.hasNext()){
Element next = it.next();
long itemID2 = docIndex.get(next.index());
double similarity = next.get();
//System.out.println(itemID1+ " : "+itemID2+" : "+similarity);
if (similarity < -1.0) {
similarity = -1.0;
} else if (similarity > 1.0) {
similarity = 1.0;
}
itemItemCorrelation = new GenericItemSimilarity.ItemItemSimilarity(itemID1, itemID2, similarity);
corrMatrix.add(itemItemCorrelation);
}
}
reader.close();
n++;
logger.info("File "+fileEntry.getName()+" readed ("+n+"/"+folder.listFiles().length+")");
}
}
}
return corrMatrix;
}
public static TIntLongHashMap getDocIndex(String docIndex) throws IOException{
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
TIntLongHashMap map = new TIntLongHashMap();
SequenceFile.Reader docIndexReader = new SequenceFile.Reader(fs, new Path(docIndex), conf);
IntWritable key = new IntWritable();
Text value = new Text();
while (docIndexReader.next(key, value)) {
map.put(key.get(), Long.parseLong(value.toString()));
}
return map;
}
At the end, in your recommendation class you call this:
TIntLongHashMap docIndex = ItemPairwiseSimilarityUtil.getDocIndex(filename);
TLongObjectHashMap<TLongDoubleHashMap> correlationMatrix = ItemPairwiseSimilarityUtil.correlatedItems(folder, docIndex);
Where filename
is your docIndex filename, and folder
is the folder of the item-similarity files. At the end, this is nothing more than item-item based recommendation.
Hope this can help you