Spark Dataframe select based on column index

前端 未结 3 1510
面向向阳花
面向向阳花 2021-02-06 03:53

How do I select all the columns of a dataframe that has certain indexes in Scala?

For example if a dataframe has 100 columns and i want to extract only columns (10,12,13

3条回答
  •  挽巷
    挽巷 (楼主)
    2021-02-06 04:15

    Example: Grab first 14 columns of Spark Dataframe by Index using Scala.

    import org.apache.spark.sql.functions.col
    
    // Gives array of names by index (first 14 cols for example)
    val sliceCols = df.columns.slice(0, 14)
    // Maps names & selects columns in dataframe
    val subset_df = df.select(sliceCols.map(name=>col(name)):_*)
    

    You cannot simply do this (as I tried and failed):

    // Gives array of names by index (first 14 cols for example)
    val sliceCols = df.columns.slice(0, 14)
    // Maps names & selects columns in dataframe
    val subset_df = df.select(sliceCols)
    

    The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.

    OR Wrap it in a function using Currying (high five to my colleague for this):

    // Subsets Dataframe to using beg_val & end_val index.
    def subset_frame(beg_val:Int=0, end_val:Int)(df: DataFrame): DataFrame = {
      val sliceCols = df.columns.slice(beg_val, end_val)
      return df.select(sliceCols.map(name => col(name)):_*)
    }
    
    // Get first 25 columns as subsetted dataframe
    val subset_df:DataFrame = df_.transform(subset_frame(0, 25))
    

提交回复
热议问题