mahout

实战Mahout聚类算法Canopy+K-means

旧巷老猫 提交于 2019-12-04 18:57:09
Mahout是Apache的顶级开源项目,它由Lucene衍生而来,且基于Hadoop的,对处理大规模数据的机器学习的经典算法提供了高效的实现。其中,对经典的聚类算法即提供了单机实现,同时也提供了基于hadoop分布式的实现,都是非常好的学习资料。 聚类分析 聚类(Clustering)可以简单的理解为将数据对象分为多个 簇(Cluster),每个 簇 里的所有数据对象具有一定的相似性,这样一个 簇可以看多一个整体对待,以此可以提高计算质量或减少计算量。而数据对象间相似性的衡量有不少经典算法可以用,但它们所需的数据结构基本一致,那就是向量;常见的有 欧几里得距离算法、余弦距离算法、皮尔逊相关系数算法等,Mahout对此都提供了实现,并且你可以在实现自己的聚类时,通过接口切换不同的距离算法。 数据模型 在Mahout的聚类分析的计算过程中,数据对象会转化成向量( Vector )参与运算,在Mahout中的接口是 org.apache.mahout.math.Vector 它里面每个域用一个浮点数( double )表示,你可以通过继承Mahout里的基类如: AbstractVector来实现自己的向量模型,也可以直接使用一些它提供的已有实现如下: 1. DenseVector,它的实现就是一个浮点数数组,对向量里所有域都进行存储,适合用于存储密集向量。 2.

Mahout In Aciotn

两盒软妹~` 提交于 2019-12-04 18:56:54
Mahout In Aciotn 作者:Jack Zhang 来自开拓者部落 , qq群:248087140,欢迎加入我们! 本文欢迎转载,转载请注明出处 http://my.oschina.net/u/1866370/blog/287907 i.Java和IDE(略) ii.Maven(略) iii.Mahout开发环境搭建 1、Mahout官网: http://mahout.apache.org/ 2、Mahout官网上关于Mahout依赖的页面 http://mahout.apache.org/general/downloads.html 在2中可以看到Mahout的Maven坐标 <dependency> <groupId>org.apache.mahout</groupId> <artifactId>mahout-core</artifactId> <version>${mahout.version}</version> </dependency> 具体安装过程 使用Maven使Mahout的环境搭建变得简单方便,只需在pom中添加如下内容即可。 <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <mahout.version>0.6</mahout

Cassandra based Mahout user friend recommendations

。_饼干妹妹 提交于 2019-12-04 18:18:56
I want to recommend a user , a list of users which the current user can add as friends. I am using Cassandra and mahout. there is already a implementation of CassandraDataModel in mahout integration package. I want to use this class. So my recommend-er class looks like follows public class UserFriendsRecommender { @Inject private CassandraDataModel dataModel; public List<RecommendedItem> recommend(Long userId, int number) throws TasteException{ UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel); // Optional: userSimilarity.setPreferenceInferrer(new

What are the steps needed to use Mahout Native Bayes Classifier Algorithm?

孤街浪徒 提交于 2019-12-04 16:40:30
I am trying to use Native Bayes Classifier in detecting fraud transactions. I have a sample data of around 5000 in an excel sheet, this is the data which I will use for training the classifier and i have test data of around 1000 on which I will apply test classifier. Here my problem is, I dont know how to train the classifier. Do I need to transform my training data into some specific format before passing it into training classifier. How the training classifier will know which is my target value and which are its features. Can someone please help me? In order to test your data, you need to

Utilizing multiple, weighed data models for a Mahout recommender

十年热恋 提交于 2019-12-04 13:57:13
I have a boolean preference recommender based on user similarity. My data set essentially contains relations where ItemId are articles the user has decided to read. I'd like to add a second data model containing where ItemId is a subscription to a particular topic. The only way I can imagine doing this is by merging the two together, offsetting the subscription IDs so that they don't collide with the article IDs. For weighting I considered dropping the boolean preference setup and introducing preference scores, where the articles subset has a preference score of 1 (for example) and the

Interpreting output from mahout clusterdumper

最后都变了- 提交于 2019-12-04 12:35:52
问题 I ran a clustering test on crawled pages (more than 25K docs ; personal data set). I've done a clusterdump : $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-1/ --output clusteranalyze.txt The output after running cluster dumper is shown 25 elements "VL-xxxxx {}" : VL-24130{n=1312 c=[0:0.017, 10:0.007, 11:0.005, 14:0.017, 31:0.016, 35:0.006, 41:0.010, 43:0.008, 52:0.005, 59:0.010, 68:0.037, 72:0.056, 87:0.028, ... ] r=[0:0.442, 10:0.271, 11:0.198, 14:0.369, 31:0.421, ... ]} ..

Mahout runs out of heap space

限于喜欢 提交于 2019-12-04 08:36:07
I am running NaiveBayes on a set of tweets using Mahout. Two files, one 100 MB and one 300 MB. I changed JAVA_HEAP_MAX to JAVA_HEAP_MAX=-Xmx2000m ( earlier it was 1000). But even then, mahout ran for a few hours ( 2 to be precise) before it complained of heap space error. What should i do to resolve ? Some more info if it helps : I am running on a single node, my laptop infact and it has 3GB of RAM (only) . Thanks. EDIT: I ran it the third time with <1/2 of the data that i used the first time ( first time i used 5.5 million tweets, second i used 2million ) and i still got a heap space problem.

Mahout rowSimilarity

岁酱吖の 提交于 2019-12-04 04:47:52
问题 I am trying to compute row similarity between wikipedia documents. I have the tf-idf vectors in format Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable . I am following the quick tour of text analysis from here: https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis+using+the+Mahout+command+line I created a mahout matrix as follows: mahout rowid \ -i wikipedia-vectors/tfidf-vectors/part-r-00000 -o wikipedia-matrix I

Why is Maven trying to compile my code as -source 1.3?

六月ゝ 毕业季﹏ 提交于 2019-12-04 04:19:32
I get this error mvn -e package in Ubuntu 12.04: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.0.2:compile (default-compile) on project HadoopSkeleton: Compilation failure: Compilation failure: [ERROR] /home/jesvin/dev/hadoop/HadoopMahoutSkeleton-master/src/main/java/HadoopSkeleton/App.java:[22,8] error: generics are not supported in -source 1.3 [ERROR] [ERROR] (use -source 5 or higher to enable generics) [ERROR] /home/jesvin/dev/hadoop/HadoopMahoutSkeleton-master/src/main/java/HadoopSkeleton/App.java:[53,28] error: for-each loops are not supported in -source

Mahout: adjusted cosine similarity for item based recommender

六月ゝ 毕业季﹏ 提交于 2019-12-03 20:29:33
For an assignment I'm supposed to test different types of recommenders, which I have to implement first. I've been looking around for a good library to do that (I had thought about Weka at first) and stumbled upon Mahout. I must therefore put forward that: a) I'm completely new to Mahout b) I do not have a strong background in recommenders nor their algorithms (otherwise I wouldn't be doing this class...) and c) sorry but I'm far from being the best developper in the world ==> I'd appreciate if you could use layman terms (as far as possible...) :) I've been following some tutorials (e.g. this