I have a dataframe in Spark 1.6 and want to select just some columns out of it. The column names are like:
colA, colB, colC, colD, colE, colF-0, colF-1, colF-2
<
I wrote a function that does that. Read the comments to see how it works.
/**
* Given a sequence of prefixes, select suitable columns from [[DataFrame]]
* @param columnPrefixes Sequence of prefixes
* @param dF Incoming [[DataFrame]]
* @return [[DataFrame]] with prefixed columns selected
*/
def selectPrefixedColumns(columnPrefixes: Seq[String], dF: DataFrame): DataFrame = {
// Find out if given column name matches any of the provided prefixes
def colNameStartsWith: String => Boolean = (colName: String) =>
columnsPrefix.map(prefix => colName.startsWith(prefix)).reduce(_ || _)
// Filter columns list by checking against given prefixes sequence
val columns = dF.columns.filter(colNameStartsWith)
// Select filtered columns list
dF.select(columns.head, columns.tail:_*)
}