问题
I have a dataset with ~5M rows x 20 columns, containing a groupID and a rowID. My goal is to check whether (some) columns contain more than a fixed fraction (say, 50%) of missing (null) values within a group. If this is found, the entire column is set to missing (null), for that group.
df = spark.read.parquet('path/to/parquet/')
check_columns = {'col1': ..., 'col2': ..., ...} # currently len(check_columns) = 8
for col, _ in check_columns.items():
total = (df
.groupBy('groupID').count()
.toDF('groupID', 'n_total')
)
missing = (df
.where(F.col(col).isNull())
.groupBy('groupID').count()
.toDF('groupID', 'n_missing')
)
# count_missing = count_missing.persist() # PERSIST TRY 1
# print('col {} found {} missing'.format(col, missing.count())) # missing.count() is b/w 1k-5k
poor_df = (total
.join(missing, 'groupID')
.withColumn('freq', F.col('n_missing') / F.col('n_total'))
.where(F.col('freq') > 0.5)
.select('groupID')
.toDF('poor_groupID')
)
df = (df
.join(poor_df, df['groupID'] == poor_df['poor_groupID'], 'left_outer')
.withColumn(col, (F.when(F.col('poor_groupID').isNotNull(), None)
.otherwise(df[col])
)
)
.select(df.columns)
)
stats = (missing
.withColumnRenamed('n_missing', 'cnt')
.collect() # FAIL 1
)
# df = df.persist() # PERSIST TRY 2
print(df.count()) # FAIL 2
I initially assigned 1G of spark.driver.memory
and 4G of spark.executor.memory
, eventually increasing the spark.driver.memory
up to 10G.
Problem(s): The loop runs like a charm during the first iterations, but towards the end, around the 6th or 7th iteration I see my CPU utilization dropping (using 1 instead of 6 cores). Along with that, execution time for one iteration increases significantly. At some point, I get an OutOfMemory Error:
spark.driver.memory < 4G
: atcollect()
(FAIL 1
)4G <= spark.driver.memory < 10G
: at thecount()
step (FAIL 2
)
Stack Trace for FAIL 1
case (relevant part):
[...]
py4j.protocol.Py4JJavaError: An error occurred while calling o1061.collectToPython.
: java.lang.OutOfMemoryError: Java heap space
[...]
The executor UI does not reflect excessive memory usage (it shows a <50k used
memory for the driver and <1G for the executor). The Spark metrics system
(app-XXX.driver.BlockManager.memory.memUsed_MB
) does not either: it shows
600M to 1200M of used memory, but always >300M remaining memory.
(This would suggest that 2G driver memory should do it, but it doesn't.)
It also does not matter which column is processed first (as it is a loop over
a dict()
, it can be in arbitrary order).
My questions thus:
- What causes the OutOfMemory Error and why are not all available CPU cores used towards the end?
- And why do I need 10G
spark.driver.memory
when I am transferring only a few kB from the executors to the driver?
A few (general) questions to make sure I understand things properly:
- If I get an OOM error, the right place to look at is almost always the driver (b/c the executor spills to disk)?
- Why would
count()
cause an OOM error - I thought this action would only consume resources on the exector(s) (delivering a few bytes to the driver)? - Are the memory metrics (metrics system, UI) mentioned above the correct places to look at?
BTW: I run Spark 2.1.0 in standalone mode.
UPDATE 2017-04-28
To drill down further, I enabled a heap dump for the driver:
cfg = SparkConfig()
cfg.set('spark.driver.extraJavaOptions', '-XX:+HeapDumpOnOutOfMemoryError')
I ran it with 8G
of spark.driver.memory
and I analyzed the heap dump with
Eclipse MAT. It turns out there are two classes of considerable size (~4G each):
java.lang.Thread
- char (2G)
- scala.collection.IndexedSeqLike
- scala.collection.mutable.WrappedArray (1G)
- java.lang.String (1G)
org.apache.spark.sql.execution.ui.SQLListener
- org.apache.spark.sql.execution.ui.SQLExecutionUIData
(various of up to 1G in size)
- java.lang.String
- ...
I tried to turn off the UI, using
cfg.set('spark.ui.enabled', 'false')
which made the UI unavailable, but didn't help on the OOM error. Also, I tried to have the UI to keep less history, using
cfg.set('spark.ui.retainedJobs', '1')
cfg.set('spark.ui.retainedStages', '1')
cfg.set('spark.ui.retainedTasks', '1')
cfg.set('spark.sql.ui.retainedExecutions', '1')
cfg.set('spark.ui.retainedDeadExecutors', '1')
This also did not help.
UPDATE 2017-05-18
I found out about Spark's pyspark.sql.DataFrame.checkpoint
method. This is like persist but gets rid of the dataframe's lineage. Thus it helps to circumvent the above mentioned issues.
来源:https://stackoverflow.com/questions/43637913/apache-spark-outofmemoryerror-heapspace