pyspark approxQuantile function

后端 未结 3 1589
粉色の甜心
粉色の甜心 2021-02-06 02:55

I have dataframe with these columns id, price, timestamp.

I would like to find median value grouped by id.

I

3条回答
  •  温柔的废话
    2021-02-06 03:38

    Well, indeed it is not possible to use approxQuantile to fill values in a new dataframe column, but this is not why you are getting this error. Unfortunately, the whole underneath story is a rather frustrating one, as I have argued that is the case with many Spark (especially PySpark) features and their lack of adequate documentation.

    To start with, there is not one, but two approxQuantile methods; the first one is part of the standard DataFrame class, i.e. you don't need to import DataFrameStatFunctions:

    spark.version
    # u'2.1.1'
    
    sampleData = [("bob","Developer",125000),("mark","Developer",108000),("carl","Tester",70000),("peter","Developer",185000),("jon","Tester",65000),("roman","Tester",82000),("simon","Developer",98000),("eric","Developer",144000),("carlos","Tester",75000),("henry","Developer",110000)]
    
    df = spark.createDataFrame(sampleData, schema=["Name","Role","Salary"])
    df.show()
    # +------+---------+------+ 
    # |  Name|     Role|Salary|
    # +------+---------+------+
    # |   bob|Developer|125000| 
    # |  mark|Developer|108000|
    # |  carl|   Tester| 70000|
    # | peter|Developer|185000|
    # |   jon|   Tester| 65000|
    # | roman|   Tester| 82000|
    # | simon|Developer| 98000|
    # |  eric|Developer|144000|
    # |carlos|   Tester| 75000|
    # | henry|Developer|110000|
    # +------+---------+------+
    
    med = df.approxQuantile("Salary", [0.5], 0.25) # no need to import DataFrameStatFunctions
    med
    # [98000.0]
    

    The second one is part of DataFrameStatFunctions, but if you use it as you do, you get the error you report:

    from pyspark.sql import DataFrameStatFunctions as statFunc
    med2 = statFunc.approxQuantile( "Salary", [0.5], 0.25)
    # TypeError: unbound method approxQuantile() must be called with DataFrameStatFunctions instance as first argument (got str instance instead)
    

    because the correct usage is

    med2 = statFunc(df).approxQuantile( "Salary", [0.5], 0.25)
    med2
    # [82000.0]
    

    although you won't be able to find a simple example in the PySpark documentation about this (it took me some time to figure it out myself)... The best part? The two values are not equal:

    med == med2
    # False
    

    I suspect this is due to the non-deterministic algorithm used (after all, it is supposed to be an approximate median), and even if you re-run the commands with the same toy data you may get different values (and different from the ones I report here) - I suggest to experiment a little to get the feeling...

    But, as I already said, this is not the reason why you cannot use approxQuantile to fill values in a new dataframe column - even if you use the correct syntax, you will get a different error:

    df2 = df.withColumn('median_salary', statFunc(df).approxQuantile( "Salary", [0.5], 0.25))
    # AssertionError: col should be Column
    

    Here, col refers to the second argument of the withColumn operation, i.e. the approxQuantile one, and the error message says that it is not a Column type - indeed, it is a list:

    type(statFunc(df).approxQuantile( "Salary", [0.5], 0.25))
    # list
    

    So, when filling column values, Spark expects arguments of type Column, and you cannot use lists; here is an example of creating a new column with mean values per Role instead of median ones:

    import pyspark.sql.functions as func
    from pyspark.sql import Window
    
    windowSpec = Window.partitionBy(df['Role'])
    df2 = df.withColumn('mean_salary', func.mean(df['Salary']).over(windowSpec))
    df2.show()
    # +------+---------+------+------------------+
    # |  Name|     Role|Salary|       mean_salary| 
    # +------+---------+------+------------------+
    # |  carl|   Tester| 70000|           73000.0| 
    # |   jon|   Tester| 65000|           73000.0|
    # | roman|   Tester| 82000|           73000.0|
    # |carlos|   Tester| 75000|           73000.0|
    # |   bob|Developer|125000|128333.33333333333|
    # |  mark|Developer|108000|128333.33333333333| 
    # | peter|Developer|185000|128333.33333333333| 
    # | simon|Developer| 98000|128333.33333333333| 
    # |  eric|Developer|144000|128333.33333333333|
    # | henry|Developer|110000|128333.33333333333| 
    # +------+---------+------+------------------+
    

    which works because, contrary to approxQuantile, mean returns a Column:

    type(func.mean(df['Salary']).over(windowSpec))
    # pyspark.sql.column.Column
    

提交回复
热议问题