Spark dataframe checkpoint cleanup

前端 未结 1 1434
天涯浪人
天涯浪人 2021-01-14 08:06

I have a dataframe in spark where an entire partition from Hive has been loaded and i need to break the lineage to overwrite the same partition after some modifications to t

相关标签:
1条回答
  • 2021-01-14 08:32

    Spark has implicit mechanism for checkpoint files cleaning.

    Add this property in spark-defaults.conf.

    spark.cleaner.referenceTracking.cleanCheckpoints  true #Default is false
    

    You can find more about Spark configuration in Spark official configuration page

    If you want to remove the checkpoint directory from HDFS you can remove it with Python, in the end of your script you could use this command rmtree.

    This property spark.cleaner.referenceTracking.cleanCheckpoints as true, allows to cleaner to remove old checkpoint files inside the checkpoint directory.

    0 讨论(0)
提交回复
热议问题