spark.ml StringIndexer throws 'Unseen label' on fit()

前端 未结 2 1573
我寻月下人不归
我寻月下人不归 2020-11-27 22:09

I\'m preparing a toy spark.ml example. Spark version 1.6.0, running on top of Oracle JDK version 1.8.0_65, pyspark, ipython notebook.<

相关标签:
2条回答
  • 2020-11-27 22:10

    Unseen label is a generic message which doesn't correspond to a specific column. Most likely problem is with a following stage:

    StringIndexer(inputCol='lang', outputCol='lang_idx')
    

    with pl-PL present in train("lang") and not present in test("lang").

    You can correct it using setHandleInvalid with skip:

    from pyspark.ml.feature import StringIndexer
    
    train = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["k", "v"])
    test = sc.parallelize([(3, "foo"), (4, "foobar")]).toDF(["k", "v"])
    
    indexer = StringIndexer(inputCol="v", outputCol="vi")
    indexer.fit(train).transform(test).show()
    
    ## Py4JJavaError: An error occurred while calling o112.showString.
    ## : org.apache.spark.SparkException: Job aborted due to stage failure: 
    ##   ...
    ##   org.apache.spark.SparkException: Unseen label: foobar.
    
    indexer.setHandleInvalid("skip").fit(train).transform(test).show()
    
    ## +---+---+---+
    ## |  k|  v| vi|
    ## +---+---+---+
    ## |  3|foo|1.0|
    ## +---+---+---+
    

    or, in the latest versions, keep:

    indexer.setHandleInvalid("keep").fit(train).transform(test).show()
    
    ## +---+------+---+
    ## |  k|     v| vi|
    ## +---+------+---+
    ## |  3|   foo|0.0|
    ## |  4|foobar|2.0|
    ## +---+------+---+
    
    0 讨论(0)
  • 2020-11-27 22:32

    Okay I think I got this. At least I got this working.

    Caching the dataframe(including train/test partes) solves the problem. That's what I found in this JIRA issue: https://issues.apache.org/jira/browse/SPARK-12590.

    So it's not a bug, just the fact that randomSample might yield a different result on the same, but differently partitioned dataset. And apparently, some of my munging functions (or Pipeline) involve repartition, therefore, results of the trainset recomputation from its definition might diverge.

    What still interests me - it's the reproducibility: it's always 'pl-PL' row that gets mixed in the wrong part of the dataset, i.e. it's not random repartition. It's deterministic, just inconsistent. I wonder how exactly it happens.

    0 讨论(0)
提交回复
热议问题