How to find pyspark dataframe memory usage?

后端未结

关注

 4  433

情深已故

For python dataframe, info() function provides memory usage. Is there any equivalent in pyspark ? Thanks

相关标签:

4条回答

一生所求

2021-02-03 12:57
I have something in mind, its just a rough estimation. as far as i know spark doesn't have a straight forward way to get dataframe memory usage, But Pandas dataframe does. so what you can do is.
1. select 1% of data sample = df.sample(fraction = 0.01)
2. pdf = df.toPandas()
3. get pandas dataframe memory usage by pdf.info()
4. Multiply that values by 100, this should give a rough estimate of your whole spark dataframe memory usage.
5. Correct me if i am wrong :|
0 讨论(0)
发布评论:

提交评论
- 加载中...

面向向阳花

2021-02-03 13:00

Try to use the _to_java_object_rdd() function:

import py4j.protocol  
from py4j.protocol import Py4JJavaError  
from py4j.java_gateway import JavaObject  
from py4j.java_collections import JavaArray, JavaList

from pyspark import RDD, SparkContext  
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer

# your dataframe what you'd estimate
df

# Helper function to convert python object to Java objects
def _to_java_object_rdd(rdd):  
    """ Return a JavaRDD of Object by unpickling
    It will convert each Python object into Java object by Pyrolite, whenever the
    RDD is serialized in batch or not.
    """
    rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
    return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)

# First you have to convert it to an RDD 
JavaObj = _to_java_object_rdd(df.rdd)

# Now we can run the estimator
sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)

0 讨论(0)

别跟我提以往

2021-02-03 13:04
How about below? It's in KB, X100 to get the estimated real size.
```
df.sample(fraction = 0.01).cache().count()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2021-02-03 13:21

You can persist dataframe in memory and take action as df.count(). You would be able to check the size under storage tab on spark web ui.. let me know if it works for you.

0 讨论(0)
发布评论:

提交评论
- 加载中...