spark.ml StringIndexer throws 'Unseen label' on fit()

前端未结

关注

 2  1573

I\'m preparing a toy spark.ml example. Spark version 1.6.0, running on top of Oracle JDK version 1.8.0_65, pyspark, ipython notebook.<

相关标签:

2条回答

遇见更好的自我

2020-11-27 22:10

Unseen label is a generic message which doesn't correspond to a specific column. Most likely problem is with a following stage:

StringIndexer(inputCol='lang', outputCol='lang_idx')

with pl-PL present in train("lang") and not present in test("lang").

You can correct it using setHandleInvalid with skip:

from pyspark.ml.feature import StringIndexer

train = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["k", "v"])
test = sc.parallelize([(3, "foo"), (4, "foobar")]).toDF(["k", "v"])

indexer = StringIndexer(inputCol="v", outputCol="vi")
indexer.fit(train).transform(test).show()

## Py4JJavaError: An error occurred while calling o112.showString.
## : org.apache.spark.SparkException: Job aborted due to stage failure: 
##   ...
##   org.apache.spark.SparkException: Unseen label: foobar.

indexer.setHandleInvalid("skip").fit(train).transform(test).show()

## +---+---+---+
## |  k|  v| vi|
## +---+---+---+
## |  3|foo|1.0|
## +---+---+---+

or, in the latest versions, keep:

indexer.setHandleInvalid("keep").fit(train).transform(test).show()

## +---+------+---+
## |  k|     v| vi|
## +---+------+---+
## |  3|   foo|0.0|
## |  4|foobar|2.0|
## +---+------+---+

0 讨论(0)

长发绾君心

2020-11-27 22:32

Okay I think I got this. At least I got this working.

Caching the dataframe(including train/test partes) solves the problem. That's what I found in this JIRA issue: https://issues.apache.org/jira/browse/SPARK-12590.

So it's not a bug, just the fact that randomSample might yield a different result on the same, but differently partitioned dataset. And apparently, some of my munging functions (or Pipeline) involve repartition, therefore, results of the trainset recomputation from its definition might diverge.

What still interests me - it's the reproducibility: it's always 'pl-PL' row that gets mixed in the wrong part of the dataset, i.e. it's not random repartition. It's deterministic, just inconsistent. I wonder how exactly it happens.

0 讨论(0)
发布评论:

提交评论
- 加载中...