Spark Dataframe select based on column index

前端未结

关注

 3  1510

面向向阳花 2021-02-06 03:53

How do I select all the columns of a dataframe that has certain indexes in Scala?

For example if a dataframe has 100 columns and i want to extract only columns (10,12,13

3条回答

挽巷 (楼主)

2021-02-06 04:15

Example: Grab first 14 columns of Spark Dataframe by Index using Scala.

import org.apache.spark.sql.functions.col

// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols.map(name=>col(name)):_*)

You cannot simply do this (as I tried and failed):

// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols)

The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.

OR Wrap it in a function using Currying (high five to my colleague for this):

// Subsets Dataframe to using beg_val & end_val index.
def subset_frame(beg_val:Int=0, end_val:Int)(df: DataFrame): DataFrame = {
  val sliceCols = df.columns.slice(beg_val, end_val)
  return df.select(sliceCols.map(name => col(name)):_*)
}

// Get first 25 columns as subsetted dataframe
val subset_df:DataFrame = df_.transform(subset_frame(0, 25))

0 讨论(0)

查看其它3个回答