PySpark reduceByKey causes out of memory
I'm trying to run a job on Yarn mode that processes a large amount of data (2TB) read from google cloud storage. My pipeline works just fine with 10GB of data. The specs of my cluster and the beginning of my pipeline is detailed here : PySpark Yarn Application fails on groupBy Here is the rest of the pipeline : input.groupByKey()\ [...] processing on sorted groups for each key shard .mapPartitions(sendPartition)\ .map(mergeShardsbyKey) .reduceByKey(lambda list1, list2: list1 + list2).take(10) [...] output the map function that is applied over partitions is the following : def sendPartition