Spark Dataframe select based on column index

前端 未结 3 1506
面向向阳花
面向向阳花 2021-02-06 03:53

How do I select all the columns of a dataframe that has certain indexes in Scala?

For example if a dataframe has 100 columns and i want to extract only columns (10,12,13

3条回答
  •  小鲜肉
    小鲜肉 (楼主)
    2021-02-06 04:22

    You can map over columns:

    import org.apache.spark.sql.functions.col
    
    df.select(colNos map df.columns map col: _*)
    

    or:

    df.select(colNos map (df.columns andThen col): _*)
    

    or:

    df.select(colNos map (col _ compose df.columns): _*)
    

    All the methods shown above are equivalent and don't impose performance penalty. Following mapping:

    colNos map df.columns 
    

    is just a local Array access (constant time access for each index) and choosing between String or Column based variant of select doesn't affect the execution plan:

    val df = Seq((1, 2, 3 ,4, 5, 6)).toDF
    
    val colNos = Seq(0, 3, 5)
    
    df.select(colNos map df.columns map col: _*).explain
    
    == Physical Plan ==
    LocalTableScan [_1#46, _4#49, _6#51]
    
    df.select("_1", "_4", "_6").explain
    
    == Physical Plan ==
    LocalTableScan [_1#46, _4#49, _6#51]
    

提交回复
热议问题