Best way to get the max value in a Spark dataframe column

后端 未结 13 952
一整个雨季
一整个雨季 2020-12-07 10:27

I\'m trying to figure out the best way to get the largest value in a Spark dataframe column.

Consider the following example:

df = spark.createDataFra         


        
相关标签:
13条回答
  • 2020-12-07 10:36

    Here is a lazy way of doing this, by just doing compute Statistics:

    df.write.mode("overwrite").saveAsTable("sampleStats")
    Query = "ANALYZE TABLE sampleStats COMPUTE STATISTICS FOR COLUMNS " + ','.join(df.columns)
    spark.sql(Query)
    
    df.describe('ColName')
    

    or

    spark.sql("Select * from sampleStats").describe('ColName')
    

    or you can open a hive shell and

    describe formatted table sampleStats;
    

    You will see the statistics in the properties - min, max, distinct, nulls, etc.

    0 讨论(0)
  • 2020-12-07 10:40

    Remark: Spark is intended to work on Big Data - distributed computing. The size of the example DataFrame is very small, so the order of real-life examples can be altered with respect to the small ~ example.

    Slowest: Method_1, because .describe("A") calculates min, max, mean, stddev, and count (5 calculations over the whole column)

    Medium: Method_4, because, .rdd (DF to RDD transformation) slows down the process.

    Faster: Method_3 ~ Method_2 ~ method_5, because the logic is very similar, so Spark's catalyst optimizer follows very similar logic with minimal number of operations (get max of a particular column, collect a single-value dataframe); (.asDict() adds a little extra-time comparing 3,2 to 5)

    import pandas as pd
    import time
    
    time_dict = {}
    
    dfff = self.spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"])
    #--  For bigger/realistic dataframe just uncomment the following 3 lines
    #lst = list(np.random.normal(0.0, 100.0, 100000))
    #pdf = pd.DataFrame({'A': lst, 'B': lst, 'C': lst, 'D': lst})
    #dfff = self.sqlContext.createDataFrame(pdf)
    
    tic1 = int(round(time.time() * 1000))
    # Method 1: Use describe()
    max_val = float(dfff.describe("A").filter("summary = 'max'").select("A").collect()[0].asDict()['A'])
    tac1 = int(round(time.time() * 1000))
    time_dict['m1']= tac1 - tic1
    print (max_val)
    
    tic2 = int(round(time.time() * 1000))
    # Method 2: Use SQL
    dfff.registerTempTable("df_table")
    max_val = self.sqlContext.sql("SELECT MAX(A) as maxval FROM df_table").collect()[0].asDict()['maxval']
    tac2 = int(round(time.time() * 1000))
    time_dict['m2']= tac2 - tic2
    print (max_val)
    
    tic3 = int(round(time.time() * 1000))
    # Method 3: Use groupby()
    max_val = dfff.groupby().max('A').collect()[0].asDict()['max(A)']
    tac3 = int(round(time.time() * 1000))
    time_dict['m3']= tac3 - tic3
    print (max_val)
    
    tic4 = int(round(time.time() * 1000))
    # Method 4: Convert to RDD
    max_val = dfff.select("A").rdd.max()[0]
    tac4 = int(round(time.time() * 1000))
    time_dict['m4']= tac4 - tic4
    print (max_val)
    
    tic5 = int(round(time.time() * 1000))
    # Method 4: Convert to RDD
    max_val = dfff.agg({"A": "max"}).collect()[0][0]
    tac5 = int(round(time.time() * 1000))
    time_dict['m5']= tac5 - tic5
    print (max_val)
    
    print time_dict
    

    Result on an edge-node of a cluster in milliseconds (ms):

    small DF (ms) : {'m1': 7096, 'm2': 205, 'm3': 165, 'm4': 211, 'm5': 180}

    bigger DF (ms): {'m1': 10260, 'm2': 452, 'm3': 465, 'm4': 916, 'm5': 373}

    0 讨论(0)
  • 2020-12-07 10:40

    I believe the best solution will be using head()

    Considering your example:

    +---+---+
    |  A|  B|
    +---+---+
    |1.0|4.0|
    |2.0|5.0|
    |3.0|6.0|
    +---+---+
    

    Using agg and max method of python we can get the value as following :

    from pyspark.sql.functions import max df.agg(max(df.A)).head()[0]

    This will return: 3.0

    Make sure you have the correct import:
    from pyspark.sql.functions import max The max function we use here is the pySPark sql library function, not the default max function of python.

    0 讨论(0)
  • 2020-12-07 10:45

    In case some wonders how to do it using Scala (using Spark 2.0.+), here you go:

    scala> df.createOrReplaceTempView("TEMP_DF")
    scala> val myMax = spark.sql("SELECT MAX(x) as maxval FROM TEMP_DF").
        collect()(0).getInt(0)
    scala> print(myMax)
    117
    
    0 讨论(0)
  • 2020-12-07 10:49
    >df1.show()
    +-----+--------------------+--------+----------+-----------+
    |floor|           timestamp|     uid|         x|          y|
    +-----+--------------------+--------+----------+-----------+
    |    1|2014-07-19T16:00:...|600dfbe2| 103.79211|71.50419418|
    |    1|2014-07-19T16:00:...|5e7b40e1| 110.33613|100.6828393|
    |    1|2014-07-19T16:00:...|285d22e4|110.066315|86.48873585|
    |    1|2014-07-19T16:00:...|74d917a1| 103.78499|71.45633073|
    
    >row1 = df1.agg({"x": "max"}).collect()[0]
    >print row1
    Row(max(x)=110.33613)
    >print row1["max(x)"]
    110.33613
    

    The answer is almost the same as method3. but seems the "asDict()" in method3 can be removed

    0 讨论(0)
  • 2020-12-07 10:49

    I used another solution (by @satprem rath) already present in this chain.

    To find the min value of age in the dataframe:

    df.agg(min("age")).show()
    
    +--------+
    |min(age)|
    +--------+
    |      29|
    +--------+
    

    edit: to add more context.

    While the above method printed the result, I faced issues when assigning the result to a variable to reuse later.

    Hence, to get only the int value assigned to a variable:

    from pyspark.sql.functions import max, min  
    
    maxValueA = df.agg(max("A")).collect()[0][0]
    maxValueB = df.agg(max("B")).collect()[0][0]
    
    0 讨论(0)
提交回复
热议问题