im trying to use Apache Spark for document classification.
For example i have two types of Class (C and J)
Train data is :
C, Chinese Beijing Chi
There any many classification methods (logistic regression, SVMs, neural networks,LDA, QDA...), you can either implement yours or use MLlib classification methods (actually, there are logistic regression and SVM implemented in MLlib)
What you need to do is transform your features to a vector, and labels to doubles.
For examples, your dataset will look like:
1, (2,1,0,0,0,0)
1, (2,0,1,0,0,0)
0, (1,0,0,1,0,0)
0, (1,0,0,0,1,1)
And tour test vector:
(3,0,0,0,1,1)
Hope this helps