问题
I have a dataset containing 42 features and 1 label. I want to apply the selection method chi square selector of the library spark ML before executing Decision tree for the detection of anomaly but I meet this error during the applciation of chi square selector:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 1 times, most recent failure: Lost task 0.0 in stage 17.0 (TID 45, localhost, executor driver): org.apache.spark.SparkException: Chi-square test expect factors (categorical values) but found more than 10000 distinct values in column 11.
Here is my source code:
from pyspark.ml.feature import ChiSqSelector
selector = ChiSqSelector(numTopFeatures=1, featuresCol="features",outputCol="features2", labelCol="label")
result = selector.fit(dfa1).transform(dfa1)
result.show()
回答1:
As you can see in error msg your features
column contains more than 10000 distinct values in vector and looks like they are continous not categorical , ChiSq can handle only 10k categories and you can't increase this value.
/**
* Max number of categories when indexing labels and features
*/
private[spark] val maxCategories: Int = 10000
In this case you can use VectorIndexer
with .setMaxCategories()
parameter < 10k to prepare your data. You can try other methods to prepare data but it will not work until your count of distinct values in vector is > 10k.
来源:https://stackoverflow.com/questions/58608274/sparkexception-chi-square-test-expect-factors