Alternatives for Spark Dataframe's count() API

情到浓时终转凉″ 提交于 2020-12-12 11:39:28

问题


I'm using Spark with Java connector to process my data.

One of the essential operations I need to do with the data is to count the number of records (row) within a data frame.

I tried df.count() but the execution time is extremely slow (30-40 seconds for 2-3M records).

Also, due to the system's requirement, I don't want to use df.rdd().countApprox() API because we need the exact count number.

Could somebody give me a suggestion of any alternatives that return exactly the same result as df.count() does, with faster execution time?

Highly appreciate your replies.


回答1:


df.cache
df.count

It will be slow for the first time, since it caches during the execution of count for the first time, but in subsequent count will provide you good performance.

Leveraging df.cache depends on the use case.




回答2:


A simple way to check if a dataframe has rows, is to do a Try(df.head). If Success, then there's at least one row in the dataframe. If Failure, then the dataframe is empty. Here's a scala implementation of this.

Here is the reason why df.count() is a slow operation.




回答3:


Count is very fast. You need to look to some of your other operations, the data loading and transformations you do to generate the Data frame that you are counting. That is the part slowing you down not the count itself.

If you can reduce the amount of data you load or cut out any transformations that don't affect the count you may be able to speed things up. If that's not an option you may be able to. Write your transformations more efficiently. Without knowing your transformations though it's not possible to say what the bottleneck might be.




回答4:


I just found out that loading data into Spark data frame for further queries and count is unecessary.

Instead, we can use aerospike client to do the job and it's much faster than the above approach.

Here's the reference of how to use aerospike client http://www.aerospike.com/launchpad/query_multiple_filters.html

Thanks everyone



来源:https://stackoverflow.com/questions/45953386/alternatives-for-spark-dataframes-count-api

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!