How to find pyspark dataframe memory usage?

后端 未结 4 420
情深已故
情深已故 2021-02-03 12:29

For python dataframe, info() function provides memory usage. Is there any equivalent in pyspark ? Thanks

4条回答
  •  一生所求
    2021-02-03 12:57

    I have something in mind, its just a rough estimation. as far as i know spark doesn't have a straight forward way to get dataframe memory usage, But Pandas dataframe does. so what you can do is.

    1. select 1% of data sample = df.sample(fraction = 0.01)
    2. pdf = df.toPandas()
    3. get pandas dataframe memory usage by pdf.info()
    4. Multiply that values by 100, this should give a rough estimate of your whole spark dataframe memory usage.
    5. Correct me if i am wrong :|

提交回复
热议问题