I know there are plenty of questions on SO about out of memory errors on Spark but I haven\'t found a solution to mine.
I have a simple workflow:
When you say collect on the dataframe there are 2 things happening,
Answer:
If you are looking to just load the data into memory of the exceutors, count() is also an action that will load the data into the executor's memory which can be used by other processes.
If you want to extract the data, then try this along with other properties when puling the data "--conf spark.driver.maxResultSize=10g".