Replace missing values with mean - Spark Dataframe

前端 未结 3 1781
青春惊慌失措
青春惊慌失措 2020-11-27 21:55

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new t

相关标签:
3条回答
  • 2020-11-27 22:10

    For imputing the median (instead of the mean) in PySpark < 2.2

    ## filter numeric cols
    num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
    ### Compute a dict with <col_name, median_value>
    median_dict = dict()
    for c in num_cols:
       median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]
    

    Then, apply na.fill

    df_imputed = df.na.fill(median_dict)
    
    0 讨论(0)
  • 2020-11-27 22:17

    Spark >= 2.2

    You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).

    Scala :

    import org.apache.spark.ml.feature.Imputer
    
    val imputer = new Imputer()
      .setInputCols(df.columns)
      .setOutputCols(df.columns.map(c => s"${c}_imputed"))
      .setStrategy("mean")
    
    imputer.fit(df).transform(df)
    

    Python:

    from pyspark.ml.feature import Imputer
    
    imputer = Imputer(
        inputCols=df.columns, 
        outputCols=["{}_imputed".format(c) for c in df.columns]
    )
    imputer.fit(df).transform(df)
    

    Spark < 2.2

    Here you are:

    import org.apache.spark.sql.functions.mean
    
    df.na.fill(df.columns.zip(
      df.select(df.columns.map(mean(_)): _*).first.toSeq
    ).toMap)
    

    where

    df.columns.map(mean(_)): Array[Column] 
    

    computes an average for each column,

    df.select(_: *).first.toSeq: Seq[Any]
    

    collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),

    df.columns.zip(_).toMap: Map[String,Any] 
    

    creates aMap: Map[String, Any] which maps from the column name to its average, and finally:

    df.na.fill(_): DataFrame
    

    fills the missing values using:

    fill: Map[String, Any] => DataFrame 
    

    from DataFrameNaFunctions.

    To ingore NaN entries you can replace:

    df.select(df.columns.map(mean(_)): _*).first.toSeq
    

    with:

    import org.apache.spark.sql.functions.{col, isnan, when}
    
    
    df.select(df.columns.map(
      c => mean(when(!isnan(col(c)), col(c)))
    ): _*).first.toSeq
    
    0 讨论(0)
  • 2020-11-27 22:29

    For PySpark, this is the code I used:

    mean_dict = { col: 'mean' for col in df.columns }
    col_avgs = df.agg( mean_dict ).collect()[0].asDict()
    col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
    df.fillna( col_avgs ).show()
    

    The four steps are:

    1. Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
    2. Calculate the mean for each column, and save it as the dictionary col_avgs
    3. The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
    4. Fill the columns of the dataframe with the averages using col_avgs
    0 讨论(0)
提交回复
热议问题