get min and max from a specific column scala spark dataframe

后端 未结 7 1076
梦谈多话
梦谈多话 2021-02-01 04:37

I would like to access to the min and max of a specific column from my dataframe but I don\'t have the header of the column, just its number, so I should I do using scala ?

相关标签:
7条回答
  • 2021-02-01 04:53

    How about getting the column name from the metadata:

    val selectedColumnName = df.columns(q) //pull the (q + 1)th column from the columns array
    df.agg(min(selectedColumnName), max(selectedColumnName))
    
    0 讨论(0)
  • 2021-02-01 04:56

    You can use pattern matching while assigning variable:

    import org.apache.spark.sql.functions.{min, max}
    import org.apache.spark.sql.Row
    
    val Row(minValue: Double, maxValue: Double) = df.agg(min(q), max(q)).head
    

    Where q is either a Column or a name of column (String). Assuming your data type is Double.

    0 讨论(0)
  • 2021-02-01 04:56

    You can use the column number to extract the column names first (by indexing df.columns), then aggregate use the column names:

    val df = Seq((2.0, 2.1), (1.2, 1.4)).toDF("A", "B")
    // df: org.apache.spark.sql.DataFrame = [A: double, B: double]
    
    df.agg(max(df(df.columns(1))), min(df(df.columns(1)))).show
    +------+------+
    
    |max(B)|min(B)|
    +------+------+
    |   2.1|   1.4|
    +------+------+
    
    0 讨论(0)
  • 2021-02-01 04:58

    Here is a direct way to get the min and max from a dataframe with column names:

    val df = Seq((1, 2), (3, 4), (5, 6)).toDF("A", "B")
    
    df.show()
    /*
    +---+---+
    |  A|  B|
    +---+---+
    |  1|  2|
    |  3|  4|
    |  5|  6|
    +---+---+
    */
    
    df.agg(min("A"), max("A")).show()
    /*
    +------+------+
    |min(A)|max(A)|
    +------+------+
    |     1|     5|
    +------+------+
    */
    

    If you want to get the min and max values as separate variables, then you can convert the result of agg() above into a Row and use Row.getInt(index) to get the column values of the Row.

    val min_max = df.agg(min("A"), max("A")).head()
    // min_max: org.apache.spark.sql.Row = [1,5]
    
    val col_min = min_max.getInt(0)
    // col_min: Int = 1
    
    val col_max = min_max.getInt(1)
    // col_max: Int = 5
    
    0 讨论(0)
  • 2021-02-01 04:58

    Hope this will help

    val sales=sc.parallelize(List(
       ("West",  "Apple",  2.0, 10),
       ("West",  "Apple",  3.0, 15),
       ("West",  "Orange", 5.0, 15),
       ("South", "Orange", 3.0, 9),
       ("South", "Orange", 6.0, 18),
       ("East",  "Milk",   5.0, 5)))
    
    
    
    val salesDf= sales.toDF("store","product","amount","quantity")
    
    salesDf.registerTempTable("sales") 
    
    val result=spark.sql("SELECT store, product, SUM(amount), MIN(amount), MAX(amount), SUM(quantity) from sales GROUP BY store, product")
    
    
    //OR
    
    salesDf.groupBy("store","product").agg(min("amount"),max("amount"),sum("amount"),sum("quantity")).show
    
    
    //output
        +-----+-------+-----------+-----------+-----------+-------------+
        |store|product|min(amount)|max(amount)|sum(amount)|sum(quantity)|
        +-----+-------+-----------+-----------+-----------+-------------+
        |South| Orange|        3.0|        6.0|        9.0|           27|
        | West| Orange|        5.0|        5.0|        5.0|           15|
        | East|   Milk|        5.0|        5.0|        5.0|            5|
        | West|  Apple|        2.0|        3.0|        5.0|           25|
        +-----+-------+-----------+-----------+-----------+-------------+
    
    0 讨论(0)
  • 2021-02-01 05:12

    In Java, we have to explicitly mention org.apache.spark.sql.functions that has implementation for min and max:

    datasetFreq.agg(functions.min("Frequency"), functions.max("Frequency")).show();
    
    0 讨论(0)
提交回复
热议问题