Best way to get the max value in a Spark dataframe column

后端 未结 13 915
一整个雨季
一整个雨季 2020-12-07 10:27

I\'m trying to figure out the best way to get the largest value in a Spark dataframe column.

Consider the following example:

df = spark.createDataFra         


        
相关标签:
13条回答
  • 2020-12-07 10:32

    The below example shows how to get the max value in a Spark dataframe column.

    from pyspark.sql.functions import max
    
    df = sql_context.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"])
    df.show()
    +---+---+
    |  A|  B|
    +---+---+
    |1.0|4.0|
    |2.0|5.0|
    |3.0|6.0|
    +---+---+
    
    result = df.select([max("A")]).show()
    result.show()
    +------+
    |max(A)|
    +------+
    |   3.0|
    +------+
    
    print result.collect()[0]['max(A)']
    3.0
    

    Similarly min, mean, etc. can be calculated as shown below:

    from pyspark.sql.functions import mean, min, max
    
    result = df.select([mean("A"), min("A"), max("A")])
    result.show()
    +------+------+------+
    |avg(A)|min(A)|max(A)|
    +------+------+------+
    |   2.0|   1.0|   3.0|
    +------+------+------+
    
    0 讨论(0)
  • 2020-12-07 10:33

    Max value for a particular column of a dataframe can be achieved by using -

    your_max_value = df.agg({"your-column": "max"}).collect()[0][0]

    0 讨论(0)
  • 2020-12-07 10:33

    First add the import line:

    from pyspark.sql.functions import min, max

    To find the min value of age in the dataframe:

    df.agg(min("age")).show()
    
    +--------+
    |min(age)|
    +--------+
    |      29|
    +--------+
    

    To find the max value of age in the dataframe:

    df.agg(max("age")).show()
    
    +--------+
    |max(age)|
    +--------+
    |      77|
    +--------+
    
    0 讨论(0)
  • 2020-12-07 10:33

    in pyspark you can do this:

    max(df.select('ColumnName').rdd.flatMap(lambda x: x).collect())
    
    0 讨论(0)
  • 2020-12-07 10:33

    To just get the value use any of these

    1. df1.agg({"x": "max"}).collect()[0][0]
    2. df1.agg({"x": "max"}).head()[0]
    3. df1.agg({"x": "max"}).first()[0]

    Alternatively we could do these for 'min'

    from pyspark.sql.functions import min, max
    df1.agg(min("id")).collect()[0][0]
    df1.agg(min("id")).head()[0]
    df1.agg(min("id")).first()[0]
    
    0 讨论(0)
  • 2020-12-07 10:35
    import org.apache.spark.sql.SparkSession
    import org.apache.spark.sql.functions._
    
    val testDataFrame = Seq(
      (1.0, 4.0), (2.0, 5.0), (3.0, 6.0)
    ).toDF("A", "B")
    
    val (maxA, maxB) = testDataFrame.select(max("A"), max("B"))
      .as[(Double, Double)]
      .first()
    println(maxA, maxB)
    

    And the result is (3.0,6.0), which is the same to the testDataFrame.agg(max($"A"), max($"B")).collect()(0).However, testDataFrame.agg(max($"A"), max($"B")).collect()(0) returns a List, [3.0,6.0]

    0 讨论(0)
提交回复
热议问题