How do I run the Spark decision tree with a categorical feature set using Scala?

后端 未结 3 1964
你的背包
你的背包 2021-02-20 18:37

I have a feature set with a corresponding categoricalFeaturesInfo: Map[Int,Int]. However, for the life of me I cannot figure out how I am supposed to get the DecisionTree class

3条回答
  •  终归单人心
    2021-02-20 19:28

    Strings are not supported by LabeledPoint, one way to put it into a LabeledPoint is to split your data into multiple columns, considering that your strings are categorical.

    So for example, if you have the following dataset:

    id,String,Intvalue
    1,"a",123
    2,"b",456
    3,"c",789
    4,"a",887
    

    Then you could split your string data, making each value of the strings into a new column

    a -> 1,0,0
    b -> 0,1,0
    c -> 0,0,1
    

    As you have 3 distinct values of Strings, you will convert your string column to 3 new columns, and each value will be represented by a value in this new columns.

    Now your dataset will be

    id,String,Intvalue
    1,1,0,0,123
    2,0,1,0,456
    3,0,0,1,789
    4,1,0,0,887
    

    Which now you can convert into Double values and use it into your LabeledPoint.

    Another way to convert your strings into a LabeledPoint is to create a distinctlist of values for each column, and convert the values of the strings into the index of that string in this list. Which is not recommended because if so, in this supposed dataset it will be

    a = 0
    b = 1
    c = 2
    

    But in this case the algorithms will consider a closer to b than to c, which cannot be determined.

提交回复
热议问题