Best way to get the max value in a Spark dataframe column

I\'m trying to figure out the best way to get the largest value in a Spark dataframe column.

Consider the following example:

df = spark.createDataFra         

    Here is a lazy way of doing this, by just doing compute Statistics:

    Query = "ANALYZE TABLE sampleStats COMPUTE STATISTICS FOR COLUMNS " + ','.join(df.columns)


    spark.sql("Select * from sampleStats").describe('ColName')

    or you can open a hive shell and

    describe formatted table sampleStats;

    You will see the statistics in the properties - min, max, distinct, nulls, etc.

    Remark: Spark is intended to work on Big Data - distributed computing. The size of the example DataFrame is very small, so the order of real-life examples can be altered with respect to the small ~ example.

    Slowest: Method_1, because .describe("A") calculates min, max, mean, stddev, and count (5 calculations over the whole column)

    Medium: Method_4, because, .rdd (DF to RDD transformation) slows down the process.

    Faster: Method_3 ~ Method_2 ~ method_5, because the logic is very similar, so Spark's catalyst optimizer follows very similar logic with minimal number of operations (get max of a particular column, collect a single-value dataframe); (.asDict() adds a little extra-time comparing 3,2 to 5)

    import pandas as pd
    import time
    time_dict = {}
    dfff = self.spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"])
    #--  For bigger/realistic dataframe just uncomment the following 3 lines
    #lst = list(np.random.normal(0.0, 100.0, 100000))
    #pdf = pd.DataFrame({'A': lst, 'B': lst, 'C': lst, 'D': lst})
    #dfff = self.sqlContext.createDataFrame(pdf)
    tic1 = int(round(time.time() * 1000))
    # Method 1: Use describe()
    max_val = float(dfff.describe("A").filter("summary = 'max'").select("A").collect()[0].asDict()['A'])
    tac1 = int(round(time.time() * 1000))
    time_dict['m1']= tac1 - tic1
    print (max_val)
    tic2 = int(round(time.time() * 1000))
    # Method 2: Use SQL
    max_val = self.sqlContext.sql("SELECT MAX(A) as maxval FROM df_table").collect()[0].asDict()['maxval']
    tac2 = int(round(time.time() * 1000))
    time_dict['m2']= tac2 - tic2
    print (max_val)
    tic3 = int(round(time.time() * 1000))
    # Method 3: Use groupby()
    max_val = dfff.groupby().max('A').collect()[0].asDict()['max(A)']
    tac3 = int(round(time.time() * 1000))
    time_dict['m3']= tac3 - tic3
    print (max_val)
    tic4 = int(round(time.time() * 1000))
    # Method 4: Convert to RDD
    max_val ="A").rdd.max()[0]
    tac4 = int(round(time.time() * 1000))
    time_dict['m4']= tac4 - tic4
    print (max_val)
    tic5 = int(round(time.time() * 1000))
    # Method 4: Convert to RDD
    max_val = dfff.agg({"A": "max"}).collect()[0][0]
    tac5 = int(round(time.time() * 1000))
    time_dict['m5']= tac5 - tic5
    print (max_val)
    print time_dict

    Result on an edge-node of a cluster in milliseconds (ms):

    small DF (ms) : {'m1': 7096, 'm2': 205, 'm3': 165, 'm4': 211, 'm5': 180}

    bigger DF (ms): {'m1': 10260, 'm2': 452, 'm3': 465, 'm4': 916, 'm5': 373}

    I believe the best solution will be using head()

    Considering your example:

    |  A|  B|

    Using agg and max method of python we can get the value as following :

    from pyspark.sql.functions import max df.agg(max(df.A)).head()[0]

    This will return: 3.0

    Make sure you have the correct import:
    from pyspark.sql.functions import max The max function we use here is the pySPark sql library function, not the default max function of python.

    In case some wonders how to do it using Scala (using Spark 2.0.+), here you go:

    scala> df.createOrReplaceTempView("TEMP_DF")
    scala> val myMax = spark.sql("SELECT MAX(x) as maxval FROM TEMP_DF").
    scala> print(myMax)
    |floor|           timestamp|     uid|         x|          y|
    |    1|2014-07-19T16:00:...|600dfbe2| 103.79211|71.50419418|
    |    1|2014-07-19T16:00:...|5e7b40e1| 110.33613|100.6828393|
    |    1|2014-07-19T16:00:...|285d22e4|110.066315|86.48873585|
    |    1|2014-07-19T16:00:...|74d917a1| 103.78499|71.45633073|
    >row1 = df1.agg({"x": "max"}).collect()[0]
    >print row1
    >print row1["max(x)"]

    The answer is almost the same as method3. but seems the "asDict()" in method3 can be removed

    I used another solution (by @satprem rath) already present in this chain.

    To find the min value of age in the dataframe:

    |      29|

    edit: to add more context.

    While the above method printed the result, I faced issues when assigning the result to a variable to reuse later.

    Hence, to get only the int value assigned to a variable:

    from pyspark.sql.functions import max, min  
    maxValueA = df.agg(max("A")).collect()[0][0]
    maxValueB = df.agg(max("B")).collect()[0][0]
