I have a feature set with a corresponding categoricalFeaturesInfo: Map[Int,Int]. However, for the life of me I cannot figure out how I am supposed to get the DecisionTree class
You can first transform categories to numbers, then load data as if all features are numerical.
When you build a decision tree model in Spark, you just need to tell spark which features are categorical and also the feature's arity (the number of distinct categories of that feature) by specifying a map Map[Int, Int]()
from feature indices to its arity.
For example if you have data as:
1,a,add
2,b,more
1,c,thinking
3,a,to
1,c,me
You can first transform data into numerical format as:
1,0,0
2,1,1
1,2,2
3,0,3
1,2,4
In that format you can load data to Spark. Then if you want to tell Spark the second and the third columns are categorical, you should create a map:
categoricalFeaturesInfo = Map[Int, Int]((1,3),(2,5))
The map tells us that feature with index 1 has arity 3, and feature with index 2 has artity 5. They will be considered as categorical when we build a decision tree model passing that map as a parameter of the training function:
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
Strings are not supported by LabeledPoint, one way to put it into a LabeledPoint is to split your data into multiple columns, considering that your strings are categorical.
So for example, if you have the following dataset:
id,String,Intvalue
1,"a",123
2,"b",456
3,"c",789
4,"a",887
Then you could split your string data, making each value of the strings into a new column
a -> 1,0,0
b -> 0,1,0
c -> 0,0,1
As you have 3 distinct values of Strings, you will convert your string column to 3 new columns, and each value will be represented by a value in this new columns.
Now your dataset will be
id,String,Intvalue
1,1,0,0,123
2,0,1,0,456
3,0,0,1,789
4,1,0,0,887
Which now you can convert into Double values and use it into your LabeledPoint.
Another way to convert your strings into a LabeledPoint is to create a distinctlist of values for each column, and convert the values of the strings into the index of that string in this list. Which is not recommended because if so, in this supposed dataset it will be
a = 0
b = 1
c = 2
But in this case the algorithms will consider a closer to b than to c, which cannot be determined.
You need to confirm the type of array x. From the error log, it said that the item in array x is string which is not supported in spark. Current spark Vectors can only be filled by Double.