mahout

Using mahout in eclipse WITHOUT USING MAVEN

大城市里の小女人 提交于 2019-12-07 08:49:16
问题 I really don't want to use maven because it seems like a massive hassle. Is there any way to just download mahout and use it in my eclipse project? All I get from using maven is build path errors and millions of warnings. I have searched for a way to do this but people seem pretty set on using maven all the time. 回答1: I'm not set on it. I hate Maven. The problem you'll have with Mahout is that they've decided to use it. If that's the case, you're stuck with it, too. 回答2: Actually, I don't

Py4J has bigger overhead than Jython and JPype

泪湿孤枕 提交于 2019-12-07 05:30:17
问题 After searching for an option to run Java code from Django application(python), I found out that Py4J is the best option for me. I tried Jython, JPype and Python subprocess and each of them have certain limitations: Jython. My app runs in python. JPype is buggy. You can start JVM just once after that it fails to start again. Python subprocess. Cannot pass Java object between Python and Java, because of regular console call. On Py4J web site is written: In terms of performance, Py4J has a

Calculate TF-IDF of documents using HBase as the datasource

試著忘記壹切 提交于 2019-12-06 14:46:09
问题 I want to calculate the TF (Term Frequency) and the IDF (Inverse Document Frequency) of documents that are stored in HBase. I also want to save the calculated TF in a HBase table, also save the calculated IDF in another HBase table. Can you guide me through? I have looked at BayesTfIdfDriver from Mahout 0.4 but I am not getting a head start. 回答1: The outline of a solution is pretty straight forward: do a word count over your hbase tables, storing both term frequency and document frequency for

Cassandra based Mahout user friend recommendations

若如初见. 提交于 2019-12-06 12:20:32
问题 I want to recommend a user , a list of users which the current user can add as friends. I am using Cassandra and mahout. there is already a implementation of CassandraDataModel in mahout integration package. I want to use this class. So my recommend-er class looks like follows public class UserFriendsRecommender { @Inject private CassandraDataModel dataModel; public List<RecommendedItem> recommend(Long userId, int number) throws TasteException{ UserSimilarity userSimilarity = new

What are the steps needed to use Mahout Native Bayes Classifier Algorithm?

Deadly 提交于 2019-12-06 11:23:37
问题 I am trying to use Native Bayes Classifier in detecting fraud transactions. I have a sample data of around 5000 in an excel sheet, this is the data which I will use for training the classifier and i have test data of around 1000 on which I will apply test classifier. Here my problem is, I dont know how to train the classifier. Do I need to transform my training data into some specific format before passing it into training classifier. How the training classifier will know which is my target

Dumping clustering result with vectors names

回眸只為那壹抹淺笑 提交于 2019-12-06 11:14:08
I have created my Vectors as described in this question and have run mahout kmeans on the data. Since I'm using Mahout 0.7, the clusterdump command didn't work as described in Mahout in Action, but I got it to work like this: export HADOOP_CLASSPATH=/path/to/mahout-distribution-0.7/core/target/mahout-core-0.7-job.jar:/path/to/mahout-distribution-0.7/integration/target/mahout-integration-0.7.jar hadoop jar core/target/mahout-core-0.7-job.jar org.apache.mahout.utils.clustering.ClusterDumper -i /clustering/out/clusters-20-final -o textout -of TEXT and I am getting lines like this one: VL-1383471

Similarity function for Mahout boolean user-based recommender

孤人 提交于 2019-12-06 06:43:30
I am using Mahout to build a user-based recommendation system which operates with boolean data. I use GenericBooleanPrefUserBasedRecommender , NearestNUserNeighborhood and now trying to decide about the most suitable user similarity function. It was suggested to use either LogLikelihoodSimilarity or TanimotoCoefficientSimilarity . I tried both and am getting [subjectively evaluated] meaningful results in both cases. However the RMSE rating for the same data set is better the LogLikehood. The number of "no recommendation" is similar in both case. Can anyone recommend which of these similarity

How to read a CSV file from Hdfs?

浪子不回头ぞ 提交于 2019-12-06 05:46:06
I have my Data in a CSV file. I want to read the CSV file which is in HDFS. Can anyone help me with the code?? I'm new to hadoop. Thanks in Advance. The classes required for this are FileSystem , FSDataInputStream and Path . Client should be something like this : public static void main(String[] args) throws IOException { // TODO Auto-generated method stub Configuration conf = new Configuration(); conf.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/core-site.xml")); conf.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/hdfs-site.xml")); FileSystem fs = FileSystem.get(conf);

Running Mahout from the command line (CLASSPATH)

假装没事ソ 提交于 2019-12-06 03:40:45
Complied Mahout successfully under Windows using Maven. I'm trying to run one of the examples from the command line and I don't get what I am doing wrong. Seems like a CLASSPATH problem. Let's say I want to run the GroupLensRecommenderEvaluatorRunner example. I go to the folder with the GroupLensRecommenderEvaluatorRunner.class file in it and execute: java -cp C:/mahout/core/target/classes;. org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommenderEvaluatorRunner It gives me the NoClassDefFoundError exception for the GroupLensRecommenderEvaluatorRunner class. Is the path for -cp wrong?

推荐系统中协同过滤算法实现分析

烈酒焚心 提交于 2019-12-05 21:38:33
原创博客,欢迎转载,转载请注明: http://my.oschina.net/BreathL/blog/62519 最近研究Mahout比较多,特别是里面协同过滤算法;于是把协同过滤算法的这个实现思路与数据流程,总结了一下,以便以后对系统做优化时,有个清晰的思路,这样才能知道该如何优化且优化后数据亦能正确。 推荐中的协同过滤算法简单说明下: 首先,通过分析用户的偏好行为,来挖掘出里面物品与物品、或人与人之间的关联。 其次,通过对这些关联的关系做一定的运算,得出人与物品间喜欢程度的猜测,即推荐值。 最后,将推荐值高的物品推送给特定的人,以完成一次推荐。 这里只是笼统的介绍下,方便下边的理解,IBM的一篇博客对其原理讲解得浅显易懂,同时也很详细 《 深入推荐引擎相关算法 - 协同过滤》 ,我这里就不细讲了。 协同过滤算法大致可分为两类,基于物品的与基于用户的;区分很简单,根据上面的逻辑,若你挖掘的关系是物品与物品间的,就是基于物品的协同过滤算法,若你挖掘的关系是用户与用户间的,就是基于用户的协同过滤算法;由于它们实现是有所不同,所以我分开整理,先来看看基于物品的协同过滤实现,我自己画了一幅图: 我通过数字的顺序,来标示数据变化的方向(由小到大);下面分析下每一个步骤的功能以及实现。 首先,说明下两个大的数据源, 用户偏好数据 :UserID、ItemID、Preference