How do I run the Spark decision tree with a categorical feature set using Scala?

后端 未结 3 1010
北荒
北荒 2021-02-20 18:53

I have a feature set with a corresponding categoricalFeaturesInfo: Map[Int,Int]. However, for the life of me I cannot figure out how I am supposed to get the DecisionTree class

3条回答
  •  自闭症患者
    2021-02-20 19:32

    You can first transform categories to numbers, then load data as if all features are numerical.

    When you build a decision tree model in Spark, you just need to tell spark which features are categorical and also the feature's arity (the number of distinct categories of that feature) by specifying a map Map[Int, Int]() from feature indices to its arity.

    For example if you have data as:

    1,a,add
    2,b,more
    1,c,thinking
    3,a,to
    1,c,me
    

    You can first transform data into numerical format as:

    1,0,0
    2,1,1
    1,2,2
    3,0,3
    1,2,4
    

    In that format you can load data to Spark. Then if you want to tell Spark the second and the third columns are categorical, you should create a map:

    categoricalFeaturesInfo = Map[Int, Int]((1,3),(2,5))
    

    The map tells us that feature with index 1 has arity 3, and feature with index 2 has artity 5. They will be considered as categorical when we build a decision tree model passing that map as a parameter of the training function:

    val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
    

提交回复
热议问题