I have a folder with 150 G of txt files (around 700 files, on average each 200 MB).
I\'m using scala to process the files and calculate some aggregate statistics in the
To add another perspective based on code (as opposed to configuration): Sometimes it's best to figure out at what stage your Spark application is exceeding memory, and to see if you can make changes to fix the problem. When I was learning Spark, I had a Python Spark application that crashed with OOM errors. The reason was because I was collecting all the results back in the master rather than letting the tasks save the output.
E.g.
for item in processed_data.collect():
print(item)
processed_data.saveAsTextFile(output_dir)