Pyspark simple re-partition and toPandas() fails to finish on just 600,000+ rows
问题 I have JSON data that I am reading into a data frame with several fields, repartitioning it based on two columns, and converting to Pandas. This job keeps failing on EMR on just 600,000 rows of data with some obscure errors. I have also increased memory settings of the spark driver, and still don't see any resolution. Here is my pyspark code: enhDataDf = ( sqlContext .read.json(sys.argv[1]) ) enhDataDf = ( enhDataDf .repartition('column1', 'column2') .toPandas() ) enhDataDf = sqlContext