how to select all columns that starts with a common label

后端 未结 2 1421
不思量自难忘°
不思量自难忘° 2021-02-13 18:36

I have a dataframe in Spark 1.6 and want to select just some columns out of it. The column names are like:

colA, colB, colC, colD, colE, colF-0, colF-1, colF-2
<         


        
相关标签:
2条回答
  • 2021-02-13 19:38

    First grab the column names with df.columns, then filter down to just the column names you want .filter(_.startsWith("colF")). This gives you an array of Strings. But the select takes select(String, String*). Luckily select for columns is select(Column*), so finally convert the Strings into Columns with .map(df(_)), and finally turn the Array of Columns into a var arg with : _*.

    df.select(df.columns.filter(_.startsWith("colF")).map(df(_)) : _*).show
    

    This filter could be made more complex (same as Pandas). It is however a rather ugly solution (IMO):

    df.select(df.columns.filter(x => (x.equals("colA") || x.startsWith("colF"))).map(df(_)) : _*).show 
    

    If the list of other columns is fixed you could also merge a fixed array of columns names with filtered array.

    df.select((Array("colA", "colB") ++ df.columns.filter(_.startsWith("colF"))).map(df(_)) : _*).show
    
    0 讨论(0)
  • 2021-02-13 19:38

    I wrote a function that does that. Read the comments to see how it works.

      /**
        * Given a sequence of prefixes, select suitable columns from [[DataFrame]]
        * @param columnPrefixes Sequence of prefixes
        * @param dF Incoming [[DataFrame]]
        * @return [[DataFrame]] with prefixed columns selected
        */
      def selectPrefixedColumns(columnPrefixes: Seq[String], dF: DataFrame): DataFrame = {
        // Find out if given column name matches any of the provided prefixes
        def colNameStartsWith: String => Boolean = (colName: String) =>
            columnsPrefix.map(prefix => colName.startsWith(prefix)).reduce(_ || _)
        // Filter columns list by checking against given prefixes sequence
        val columns = dF.columns.filter(colNameStartsWith)
        // Select filtered columns list
        dF.select(columns.head, columns.tail:_*)
      }
    
    0 讨论(0)
提交回复
热议问题