问题
I'm using Spark with Java connector to process my data.
One of the essential operations I need to do with the data is to count the number of records (row) within a data frame.
I tried df.count()
but the execution time is extremely slow (30-40 seconds for 2-3M records).
Also, due to the system's requirement, I don't want to use df.rdd().countApprox()
API because we need the exact count number.
Could somebody give me a suggestion of any alternatives that return exactly the same result as df.count()
does, with faster execution time?
Highly appreciate your replies.
回答1:
df.cache
df.count
It will be slow for the first time, since it caches during the execution of count for the first time, but in subsequent count will provide you good performance.
Leveraging df.cache
depends on the use case.
回答2:
A simple way to check if a dataframe has rows, is to do a Try(df.head). If Success, then there's at least one row in the dataframe. If Failure, then the dataframe is empty. Here's a scala implementation of this.
Here is the reason why df.count() is a slow operation.
回答3:
Count is very fast. You need to look to some of your other operations, the data loading and transformations you do to generate the Data frame that you are counting. That is the part slowing you down not the count itself.
If you can reduce the amount of data you load or cut out any transformations that don't affect the count you may be able to speed things up. If that's not an option you may be able to. Write your transformations more efficiently. Without knowing your transformations though it's not possible to say what the bottleneck might be.
回答4:
I just found out that loading data into Spark data frame for further queries and count is unecessary.
Instead, we can use aerospike client to do the job and it's much faster than the above approach.
Here's the reference of how to use aerospike client http://www.aerospike.com/launchpad/query_multiple_filters.html
Thanks everyone
来源:https://stackoverflow.com/questions/45953386/alternatives-for-spark-dataframes-count-api