StandardScaler returns NaN

时间秒杀一切 提交于 2021-01-29 17:50:06

问题


env:

spark-1.6.0 with scala-2.10.4

usage:

// row of df : DataFrame = (String,String,double,Vector) as (id1,id2,label,feature)
val df = sqlContext.read.parquet("data/Labeled.parquet")
val SC = new StandardScaler()
.setInputCol("feature").setOutputCol("scaled")
.setWithMean(false).setWithStd(true).fit(df) 


val scaled = SC.transform(df)
.drop("feature").withColumnRenamed("scaled","feature")

Code as the example here http://spark.apache.org/docs/latest/ml-features.html#standardscaler

NaN exists in scaled, SC.mean, SC.std

I don't understand why StandardScaler could do this even in mean or how to handle this situation. Any advice is appreciated.

data size as parquet is 1.6GiB, if anyone needs it just let me know

UPDATE:

Get through the code of StandardScaler and this is likely to be a problem of precision of Double when MultivariateOnlineSummarizer aggregated.


回答1:


There is a value equals to Double.MaxValue and when StandardScaler sum the columns, result overflows.

Simply cast those column to scala.math.BigDecimal works.

ref here:

http://www.scala-lang.org/api/current/index.html#scala.math.BigDecimal



来源:https://stackoverflow.com/questions/35573681/standardscaler-returns-nan

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!