问题
First, I have some use history of user's app.
For example:
user1, app1, 3(launch times)
user2, app2, 2(launch times)
user3, app1, 1(launch times)
I have basically two demands:
- Recommend some app for every user.
- Recommend similar app for every app.
So I use ALS(implicit) of MLLib on spark to implement it. At first, I just use the original data to train the model. The result is terrible. I think it may caused by the range of launch times. And the launch time range from 1 to thousands. So I process the original data. I think score can reflect the true situation and more regularization.
score = lt / uMlt + lt / aMlt
score is process result to train model.
lt is launch times in original data.
uMlt is user's mean launch times in original data. uMlt(all launch times of a user) / (number of app this user ever launched)
aMlt is app's mean launch times in original data. aMlt(all launch times of a app) / (number of user who ever launched this app)
Here is some example of the data after processing.
Rating(95788,20992,0.14167073369026184)
Rating(98696,20992,5.92363166809082)
Rating(160020,11264,2.261538505554199)
Rating(67904,11264,2.261538505554199)
Rating(268430,11264,0.13846154510974884)
Rating(201369,11264,1.7999999523162842)
Rating(180857,11264,2.2720916271209717)
Rating(217692,11264,1.3692307472229004)
Rating(186274,28672,2.4250855445861816)
Rating(120820,28672,0.4422124922275543)
Rating(221146,28672,1.0074234008789062)
After I have done this, and aggregate the apps which have different package name, the result seems better. But still not good enough.
I find that the features of users and products is so small, and most of them is negative.
Here is 3 line example of products features, 10 dimensions for each line:
((CompactBuffer(com.youlin.xyzs.shoumeng, com.youlin.xyzs.juhe.shoumeng)),(-4.798973236574966E-7,-7.641608021913271E-7,6.040852440492017E-7,2.82689171626771E-7,-4.255948056197667E-7,1.815822798789668E-7,5.000047167413868E-7,2.0220664964654134E-7,6.386763402588258E-7,-4.289261710255232E-7))
((CompactBuffer(com.dncfcjaobhegbjccdhandkba.huojia)),(-4.769295992446132E-5,-1.7072002810891718E-4,2.1351299074012786E-4,1.6345139010809362E-4,-1.4456869394052774E-4,2.3657752899453044E-4,-4.508546771830879E-5,2.0895185298286378E-4,2.968782791867852E-4,1.9461760530248284E-4))
((CompactBuffer(com.tern.rest.pron)),(-1.219763362314552E-5,-2.8371430744300596E-5,2.9869115678593516E-5,2.0747662347275764E-5,-2.0555471564875916E-5,2.632938776514493E-5,2.934047643066151E-6,2.296348611707799E-5,3.8075613701948896E-5,1.2197584510431625E-5))
Here is 3 line example of users features, 10 dimensions for each line:
(96768,(-0.0010857731103897095,-0.001926362863741815,0.0013726564357057214,6.345533765852451E-4,-9.048808133229613E-4,-4.1544197301846E-5,0.0014421759406104684,-9.77902309386991E-5,0.0010355513077229261,-0.0017878251383081079))
(97280,(-0.0022841691970825195,-0.0017134940717369318,0.001027365098707378,9.437055559828877E-4,-0.0011165080359205604,0.0017137592658400536,9.713359759189188E-4,8.947265450842679E-4,0.0014328152174130082,-5.738904583267868E-4))
(97792,(-0.0017802991205826402,-0.003464450128376484,0.002837196458131075,0.0015725698322057724,-0.0018932095263153315,9.185600210912526E-4,0.0018971719546243548,7.250450435094535E-4,0.0027060359716415405,-0.0017731878906488419))
So you can imagine how small when I get dot product of the feature vectors to compute value of user-item matrix.
My question here is :
- Is there any other way to improve the recommendation result?
- Does my features seem right, or there's something going wrong?
- Is my way to process the original launch times(convert to score) right?
I put some code here. And this is absolutely a program question. But maybe can't be solved by a few lines of code.
val model = ALS.trainImplicit(ratings, rank, iterations, lambda, alpha)
print("recommendForAllUser")
val userTopKRdd = recommendForAllUser(model, topN).join(userData.map(x => (x._2._1, x._1))).map {
case (uid, (appArray, mac)) => {
(mac, appArray.map {
case (appId, rating) => {
val packageName = appIdPriorityPackageNameDict.value.getOrElse(appId, Constants.PLACEHOLDER)
(packageName, rating)
}
})
}
}
HbaseWriter.writeRddToHbase(userTopKRdd, "user_top100_recommendation", (x: (String, Array[(String, Double)])) => {
val mac = x._1
val products = x._2.map {
case (packageName, rating) => packageName + "=" + rating
}.mkString(",")
val putMap = Map("apps" -> products)
(new ImmutableBytesWritable(), Utils.getHbasePutByMap(mac, putMap))
})
print("recommendSimilarApp")
println("productFeatures ******")
model.productFeatures.take(1000).map{
case (appId, features) => {
val packageNameList = appIdPackageNameListDict.value.get(appId)
val packageNameListStr = if (packageNameList.isDefined) {
packageNameList.mkString("(", ",", ")")
} else {
"Unknow List"
}
(packageNameListStr, features.mkString("(", ",", ")"))
}
}.foreach(println)
println("productFeatures ******")
model.userFeatures.take(1000).map{
case (userId, features) => {
(userId, features.mkString("(", ",", ")"))
}
}.foreach(println)
val similarAppRdd = recommendSimilarApp(model, topN).flatMap {
case (appId, similarAppArray) => {
val groupedAppList = appIdPackageNameListDict.value.get(appId)
if (groupedAppList.isDefined) {
val similarPackageList = similarAppArray.map {
case (destAppId, rating) => (appIdPriorityPackageNameDict.value.getOrElse(destAppId, Constants.PLACEHOLDER), rating)
}
groupedAppList.get.map(packageName => {
(packageName, similarPackageList)
})
} else {
None
}
}
}
HbaseWriter.writeRddToHbase(similarAppRdd, "similar_app_top100_recommendation", (x: (String, Array[(String, Double)])) => {
val packageName = x._1
val products = x._2.map {
case (packageName, rating) => packageName + "=" + rating
}.mkString(",")
val putMap = Map("apps" -> products)
(new ImmutableBytesWritable(), Utils.getHbasePutByMap(packageName, putMap))
})
UPDATE :
I found something new about my data after reading the paper("Collaborative Filtering for Implicit Feedback Datasets"). My data is too sparse compare to the IPTV data set described in the paper.
Paper: 300,000(users) 17,000(products) 32,000,000(data)
Mine: 300,000(users) 31,000(products) 700,000(data)
So the user-item matrix in the paper's data set has been filled with 0.00627 = (32,000,000 / 300,000 / 17,000). My data set's ratio is 0.0000033. I think it means that my user-item matrix is 2000 times sparser than the paper's.
Should this lead to a bad result? And any way to improve it?
回答1:
There are two things you should try:
- Standardise your data so that it has zero mean and unit variance per user vector. This is a common step in lots of machine learning. It helps to reduce the effect of outliers, which cause the close-to-zero values you are seeing.
- Remove all users that have only a single app. The only thing you will learn from these users is a slightly better "mean" value for the app scores. They will not help you learn any meaningful relationships though, which is what you really want.
Having removed a user from the model, you will lose the ability to get a recommendation for that user directly from the model, by providing the user ID. However, they only have a single app rating anyway. So, you can instead run a KNN search over the product matrix to find apps most similar to that users apps = recommendations.
来源:https://stackoverflow.com/questions/35603789/how-to-improve-my-recommendation-result-i-am-using-spark-als-implicit