I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points:
"365","4",41401.387,5330569 "364","3",51517.886,5946290 "363","2",55059.838,6097388 "362","1",43780.977,5304694 "361","7",46447.196,5471836 "360","6",50656.121,5849862 "359","5",44494.476,5460289
Here's my code:
def parsePoint(line): split = map(sanitize, line.split(',')) rev = split.pop(-2) return LabeledPoint(rev, split) def sanitize(value): return float(value.strip('"')) parsedData = textFile.map(parsePoint) model = LinearRegressionWithSGD.train(parsedData, iterations=10) print model.predict(parsedData.first().features)
The prediction is something totally crazy, like -6.92840330273e+136
. If I don't set iterations in train()
, then I get nan
as a result. What am I doing wrong? Is it my data set (the size of it, maybe?) or my configuration?