I\'m trying to figure out the best way to get the largest value in a Spark dataframe column.
Consider the following example:
df = spark.createDataFra
The below example shows how to get the max value in a Spark dataframe column.
from pyspark.sql.functions import max
df = sql_context.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"])
df.show()
+---+---+
| A| B|
+---+---+
|1.0|4.0|
|2.0|5.0|
|3.0|6.0|
+---+---+
result = df.select([max("A")]).show()
result.show()
+------+
|max(A)|
+------+
| 3.0|
+------+
print result.collect()[0]['max(A)']
3.0
Similarly min, mean, etc. can be calculated as shown below:
from pyspark.sql.functions import mean, min, max
result = df.select([mean("A"), min("A"), max("A")])
result.show()
+------+------+------+
|avg(A)|min(A)|max(A)|
+------+------+------+
| 2.0| 1.0| 3.0|
+------+------+------+
Max value for a particular column of a dataframe can be achieved by using -
your_max_value = df.agg({"your-column": "max"}).collect()[0][0]
First add the import line:
from pyspark.sql.functions import min, max
df.agg(min("age")).show()
+--------+
|min(age)|
+--------+
| 29|
+--------+
df.agg(max("age")).show()
+--------+
|max(age)|
+--------+
| 77|
+--------+
in pyspark you can do this:
max(df.select('ColumnName').rdd.flatMap(lambda x: x).collect())
To just get the value use any of these
Alternatively we could do these for 'min'
from pyspark.sql.functions import min, max
df1.agg(min("id")).collect()[0][0]
df1.agg(min("id")).head()[0]
df1.agg(min("id")).first()[0]
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val testDataFrame = Seq(
(1.0, 4.0), (2.0, 5.0), (3.0, 6.0)
).toDF("A", "B")
val (maxA, maxB) = testDataFrame.select(max("A"), max("B"))
.as[(Double, Double)]
.first()
println(maxA, maxB)
And the result is (3.0,6.0), which is the same to the testDataFrame.agg(max($"A"), max($"B")).collect()(0)
.However, testDataFrame.agg(max($"A"), max($"B")).collect()(0)
returns a List, [3.0,6.0]