Spark ALS predictAll returns empty

后端 未结 1 807
余生分开走
余生分开走 2020-12-03 15:45

I have the following Python test code (the arguments to ALS.train are defined elsewhere):

 r1 = (2, 1)
 r2 = (3, 1)
 test = sc.parallelize([r1,          


        
相关标签:
1条回答
  • 2020-12-03 16:36

    There are two basic conditions under which MatrixFactorizationMode.predictAll may return a RDD with lower number of items than the input:

    • user is missing in the training set.
    • product is missing in the training set.

    You can easily reproduce this behavior and check that it is is not dependent on the way how RDD has been created. First lets use example data to build a model:

    from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
    
    def parse(s):
        x, y, z  = s.split(",")
        return Rating(int(x), int(y), float(z))
    
    ratings = (sc.textFile("data/mllib/als/test.data")
      .map(parse)
      .union(sc.parallelize([Rating(1, 5, 4.0)])))
    
    model = ALS.train(ratings, 10, 10)
    

    Next lets see which products and users are present in the training data:

    set(ratings.map(lambda r: r.product).collect())
    ## {1, 2, 3, 4, 5}
    
    set(ratings.map(lambda r: r.user).collect())
    ## {1, 2, 3, 4}
    

    Now lets create test data and check predictions:

    valid_test = sc.parallelize([(2, 5), (1, 4), (3, 5)])
    valid_test
    ## ParallelCollectionRDD[434] at parallelize at PythonRDD.scala:423
    
    model.predictAll(valid_test).count()
    ## 3
    

    So far so good. Next lets map it using the same logic as in your code:

    valid_test_ = valid_test.map(lambda xs: tuple(int(x) for x in xs))
    valid_test_
    ## PythonRDD[497] at RDD at PythonRDD.scala:43
    
    model.predictAll(valid_test_).count()
    ## 3
    

    Still fine. Next lets create invalid data and repeat experiment:

    invalid_test = sc.parallelize([
      (2, 6), # No product in the training data
      (6, 1)  # No user in the training data
    ])
    invalid_test 
    ## ParallelCollectionRDD[500] at parallelize at PythonRDD.scala:423
    
    model.predictAll(invalid_test).count()
    ## 0 
    
    invalid_test_ = invalid_test.map(lambda xs: tuple(int(x) for x in xs))
    model.predictAll(invalid_test_).count()
    ## 0
    

    As expected there are no predictions for invalid input.

    Finally you can confirm this is really the case by using ML model which is completely independent in training / prediction from Python code:

    from pyspark.ml.recommendation import ALS as MLALS
    
    model_ml = MLALS(rank=10, maxIter=10).fit(
        ratings.toDF(["user", "item", "rating"])
    )
    model_ml.transform((valid_test + invalid_test).toDF(["user", "item"])).show()
    
    ## +----+----+----------+
    ## |user|item|prediction|
    ## +----+----+----------+
    ## |   6|   1|       NaN|
    ## |   1|   4| 1.0184212|
    ## |   2|   5| 4.0041084|
    ## |   3|   5|0.40498763|
    ## |   2|   6|       NaN|
    ## +----+----+----------+
    

    As you can see no corresponding user / item in the training data means no prediction.

    0 讨论(0)
提交回复
热议问题