Spark request max count

£可爱£侵袭症+ 提交于 2019-12-12 14:00:56

问题


I'm a beginner on spark and I try to make a request allow me to retrieve the most visited web pages.

My request is the following

mostPopularWebPageDF = logDF.groupBy("webPage").agg(functions.count("webPage").alias("cntWebPage")).agg(functions.max("cntWebPage")).show()

With this request I retrieve only a dataframe with the max count but I want to retrieve a dataframe with this score and the web page that holds this score

Something like that:

webPage            max(cntWebPage)
google.com         2

How can I fix my problem?

Thanks a lot.


回答1:


In pyspark + sql:

logDF.registerTempTable("logDF")

mostPopularWebPageDF = sqlContext.sql("""select webPage, cntWebPage from (
                                            select webPage, count(*) as cntWebPage, max(count(*)) over () as maxcnt 
                                            from logDF 
                                            group by webPage) as tmp
                                            where tmp.cntWebPage = tmp.maxcnt""")

Maybe I can make it cleaner, but it works. I will try to optimize it.

My result:

webPage      cntWebPage
google.com   2

for dataset:

webPage    usersid
google.com 1
google.com 3
bing.com   10

Explanation: normal counting is done via grouping + count(*) function. Max of all these counts are calculated via window function, so for dataset above, immediate DataFrame /without dropping maxCount column/ is:

webPage    count  maxCount
google.com 2      2
bing.com   1      2

Then we select rows with count equal to maxCount

EDIT: I have deleted DSL version - it does not support window over () and ordering is changing result. Sorry for this bug. SQL version is correct



来源:https://stackoverflow.com/questions/40817728/spark-request-max-count

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!