How to find pyspark dataframe memory usage?

后端 未结 4 433
情深已故
情深已故 2021-02-03 12:29

For python dataframe, info() function provides memory usage. Is there any equivalent in pyspark ? Thanks

相关标签:
4条回答
  • 2021-02-03 12:57

    I have something in mind, its just a rough estimation. as far as i know spark doesn't have a straight forward way to get dataframe memory usage, But Pandas dataframe does. so what you can do is.

    1. select 1% of data sample = df.sample(fraction = 0.01)
    2. pdf = df.toPandas()
    3. get pandas dataframe memory usage by pdf.info()
    4. Multiply that values by 100, this should give a rough estimate of your whole spark dataframe memory usage.
    5. Correct me if i am wrong :|
    0 讨论(0)
  • 2021-02-03 13:00

    Try to use the _to_java_object_rdd() function:

    import py4j.protocol  
    from py4j.protocol import Py4JJavaError  
    from py4j.java_gateway import JavaObject  
    from py4j.java_collections import JavaArray, JavaList
    
    from pyspark import RDD, SparkContext  
    from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    
    # your dataframe what you'd estimate
    df
    
    # Helper function to convert python object to Java objects
    def _to_java_object_rdd(rdd):  
        """ Return a JavaRDD of Object by unpickling
        It will convert each Python object into Java object by Pyrolite, whenever the
        RDD is serialized in batch or not.
        """
        rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
        return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
    
    # First you have to convert it to an RDD 
    JavaObj = _to_java_object_rdd(df.rdd)
    
    # Now we can run the estimator
    sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)
    
    0 讨论(0)
  • 2021-02-03 13:04

    How about below? It's in KB, X100 to get the estimated real size.

    df.sample(fraction = 0.01).cache().count()
    
    0 讨论(0)
  • 2021-02-03 13:21

    You can persist dataframe in memory and take action as df.count(). You would be able to check the size under storage tab on spark web ui.. let me know if it works for you.

    0 讨论(0)
提交回复
热议问题