I am trying to find Accuracy using 5-fold cross validation using Random Forest Classifier Model in SCALA. But i am getting the following error while running:
, same as many other ML algorithms, require specific metadata to be set on the label column and labels values to be integral values from [0, 1, 2 ..., #classes) represented as doubles. Typically this is handled by an upstream Transformers
like StringIndexer
. Since you convert labels manually metadata fields are not set and classifier cannot confirm that these requirements are satisfied.
val df = Seq(
(0.0, Vectors.dense(1, 0, 0, 0)),
(1.0, Vectors.dense(0, 1, 0, 0)),
(2.0, Vectors.dense(0, 0, 1, 0)),
(2.0, Vectors.dense(0, 0, 0, 1))
).toDF("label", "features")
val rf = new RandomForestClassifier()
// java.lang.IllegalArgumentException: RandomForestClassifier was given input ...
You can either re-encode label column using StringIndexer
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
or set required metadata manually:
val meta = NominalAttribute
.withValues("0.0", "1.0", "2.0")
df.withColumn("label_meta", $"label".as("", meta))
Labels created using StringIndexer
depend on the frequency not value:
// Array[String] = Array(2.0, 0.0, 1.0)
In Python metadata fields can be set directly on the schema:
from pyspark.sql.types import StructField, DoubleType
"label", DoubleType(), False,
{"ml_attr": {
"name": "label",
"type": "nominal",
"vals": ["0.0", "1.0", "2.0"]