I\'m preparing a toy spark.ml
example. Spark version 1.6.0
, running on top of Oracle JDK version 1.8.0_65
, pyspark, ipython notebook.<
Unseen label
is a generic message which doesn't correspond to a specific column. Most likely problem is with a following stage:
StringIndexer(inputCol='lang', outputCol='lang_idx')
with pl-PL
present in train("lang")
and not present in test("lang")
.
You can correct it using setHandleInvalid
with skip
:
from pyspark.ml.feature import StringIndexer
train = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["k", "v"])
test = sc.parallelize([(3, "foo"), (4, "foobar")]).toDF(["k", "v"])
indexer = StringIndexer(inputCol="v", outputCol="vi")
indexer.fit(train).transform(test).show()
## Py4JJavaError: An error occurred while calling o112.showString.
## : org.apache.spark.SparkException: Job aborted due to stage failure:
## ...
## org.apache.spark.SparkException: Unseen label: foobar.
indexer.setHandleInvalid("skip").fit(train).transform(test).show()
## +---+---+---+
## | k| v| vi|
## +---+---+---+
## | 3|foo|1.0|
## +---+---+---+
or, in the latest versions, keep
:
indexer.setHandleInvalid("keep").fit(train).transform(test).show()
## +---+------+---+
## | k| v| vi|
## +---+------+---+
## | 3| foo|0.0|
## | 4|foobar|2.0|
## +---+------+---+
Okay I think I got this. At least I got this working.
Caching the dataframe(including train/test partes) solves the problem. That's what I found in this JIRA issue: https://issues.apache.org/jira/browse/SPARK-12590.
So it's not a bug, just the fact that randomSample
might yield a different result on the same, but differently partitioned dataset. And apparently, some of my munging functions (or Pipeline
) involve repartition, therefore, results of the trainset recomputation from its definition might diverge.
What still interests me - it's the reproducibility: it's always 'pl-PL' row that gets mixed in the wrong part of the dataset, i.e. it's not random repartition. It's deterministic, just inconsistent. I wonder how exactly it happens.