Spark Dataframe select based on column index

前端 未结 3 1511
面向向阳花
面向向阳花 2021-02-06 03:53

How do I select all the columns of a dataframe that has certain indexes in Scala?

For example if a dataframe has 100 columns and i want to extract only columns (10,12,13

相关标签:
3条回答
  • 2021-02-06 04:15

    @user6910411's answer above works like a charm and the number of tasks/logical plan is similar to my approach below. BUT my approach is a bit faster.
    So,
    I would suggest you to go with the column names rather than column numbers. Column names are much safer and much ligher than using numbers. You can use the following solution :

    val colNames = Seq("col1", "col2" ...... "col99", "col100")
    
    val selectColNames = Seq("col1", "col3", .... selected column names ... )
    
    val selectCols = selectColNames.map(name => df.col(name))
    
    df = df.select(selectCols:_*)
    

    If you are hesitant to write all the 100 column names then there is a shortcut method too

    val colNames = df.schema.fieldNames
    
    0 讨论(0)
  • 2021-02-06 04:15

    Example: Grab first 14 columns of Spark Dataframe by Index using Scala.

    import org.apache.spark.sql.functions.col
    
    // Gives array of names by index (first 14 cols for example)
    val sliceCols = df.columns.slice(0, 14)
    // Maps names & selects columns in dataframe
    val subset_df = df.select(sliceCols.map(name=>col(name)):_*)
    

    You cannot simply do this (as I tried and failed):

    // Gives array of names by index (first 14 cols for example)
    val sliceCols = df.columns.slice(0, 14)
    // Maps names & selects columns in dataframe
    val subset_df = df.select(sliceCols)
    

    The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.

    OR Wrap it in a function using Currying (high five to my colleague for this):

    // Subsets Dataframe to using beg_val & end_val index.
    def subset_frame(beg_val:Int=0, end_val:Int)(df: DataFrame): DataFrame = {
      val sliceCols = df.columns.slice(beg_val, end_val)
      return df.select(sliceCols.map(name => col(name)):_*)
    }
    
    // Get first 25 columns as subsetted dataframe
    val subset_df:DataFrame = df_.transform(subset_frame(0, 25))
    
    0 讨论(0)
  • 2021-02-06 04:22

    You can map over columns:

    import org.apache.spark.sql.functions.col
    
    df.select(colNos map df.columns map col: _*)
    

    or:

    df.select(colNos map (df.columns andThen col): _*)
    

    or:

    df.select(colNos map (col _ compose df.columns): _*)
    

    All the methods shown above are equivalent and don't impose performance penalty. Following mapping:

    colNos map df.columns 
    

    is just a local Array access (constant time access for each index) and choosing between String or Column based variant of select doesn't affect the execution plan:

    val df = Seq((1, 2, 3 ,4, 5, 6)).toDF
    
    val colNos = Seq(0, 3, 5)
    
    df.select(colNos map df.columns map col: _*).explain
    
    == Physical Plan ==
    LocalTableScan [_1#46, _4#49, _6#51]
    
    df.select("_1", "_4", "_6").explain
    
    == Physical Plan ==
    LocalTableScan [_1#46, _4#49, _6#51]
    
    0 讨论(0)
提交回复
热议问题