I would like to access to the min and max of a specific column from my dataframe but I don\'t have the header of the column, just its number, so I should I do using scala ?
How about getting the column name from the metadata:
val selectedColumnName = df.columns(q) //pull the (q + 1)th column from the columns array
df.agg(min(selectedColumnName), max(selectedColumnName))
You can use pattern matching while assigning variable:
import org.apache.spark.sql.functions.{min, max}
import org.apache.spark.sql.Row
val Row(minValue: Double, maxValue: Double) = df.agg(min(q), max(q)).head
Where q is either a Column
or a name of column (String). Assuming your data type is Double
.
You can use the column number to extract the column names first (by indexing df.columns
), then aggregate use the column names:
val df = Seq((2.0, 2.1), (1.2, 1.4)).toDF("A", "B")
// df: org.apache.spark.sql.DataFrame = [A: double, B: double]
df.agg(max(df(df.columns(1))), min(df(df.columns(1)))).show
+------+------+
|max(B)|min(B)|
+------+------+
| 2.1| 1.4|
+------+------+
Here is a direct way to get the min and max from a dataframe with column names:
val df = Seq((1, 2), (3, 4), (5, 6)).toDF("A", "B")
df.show()
/*
+---+---+
| A| B|
+---+---+
| 1| 2|
| 3| 4|
| 5| 6|
+---+---+
*/
df.agg(min("A"), max("A")).show()
/*
+------+------+
|min(A)|max(A)|
+------+------+
| 1| 5|
+------+------+
*/
If you want to get the min and max values as separate variables, then you can convert the result of agg()
above into a Row
and use Row.getInt(index)
to get the column values of the Row
.
val min_max = df.agg(min("A"), max("A")).head()
// min_max: org.apache.spark.sql.Row = [1,5]
val col_min = min_max.getInt(0)
// col_min: Int = 1
val col_max = min_max.getInt(1)
// col_max: Int = 5
Hope this will help
val sales=sc.parallelize(List(
("West", "Apple", 2.0, 10),
("West", "Apple", 3.0, 15),
("West", "Orange", 5.0, 15),
("South", "Orange", 3.0, 9),
("South", "Orange", 6.0, 18),
("East", "Milk", 5.0, 5)))
val salesDf= sales.toDF("store","product","amount","quantity")
salesDf.registerTempTable("sales")
val result=spark.sql("SELECT store, product, SUM(amount), MIN(amount), MAX(amount), SUM(quantity) from sales GROUP BY store, product")
//OR
salesDf.groupBy("store","product").agg(min("amount"),max("amount"),sum("amount"),sum("quantity")).show
//output
+-----+-------+-----------+-----------+-----------+-------------+
|store|product|min(amount)|max(amount)|sum(amount)|sum(quantity)|
+-----+-------+-----------+-----------+-----------+-------------+
|South| Orange| 3.0| 6.0| 9.0| 27|
| West| Orange| 5.0| 5.0| 5.0| 15|
| East| Milk| 5.0| 5.0| 5.0| 5|
| West| Apple| 2.0| 3.0| 5.0| 25|
+-----+-------+-----------+-----------+-----------+-------------+
In Java, we have to explicitly mention org.apache.spark.sql.functions
that has implementation for min
and max
:
datasetFreq.agg(functions.min("Frequency"), functions.max("Frequency")).show();