raised a Jira issue : https://issues.apache.org/jira/browse/SPARK-19304
The issue is because we read more data per iteration than what is required and then discard the data. This can be avoided by adding a limit to getResults
aws call.
Fix: https://github.com/apache/spark/pull/16842