raised a Jira issue : https://issues.apache.org/jira/browse/SPARK-19304
The issue is because we read more data per iteration than what is required and then discard the data. This can be avoided by adding a limit to getResults
aws call.
Fix: https://github.com/apache/spark/pull/16842
I had similar issues before, my application getting slower and slower.
try to release memory after using rdd, call rdd.unpersist()
https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#unpersist(boolean)
or spark.streaming.backpressure.enabled
to true
http://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-interval
http://spark.apache.org/docs/latest/streaming-programming-guide.html#requirements
also, check your locality
setting, maybe too much data move around.
When a failed driver is restart, the following occurs:
Since all these steps are performed at driver your batch of 0 events take so much time. This should happen with the first batch only then things will be normal.
Reference here.