How to find pyspark dataframe memory usage?

后端 未结 4 438
情深已故
情深已故 2021-02-03 12:29

For python dataframe, info() function provides memory usage. Is there any equivalent in pyspark ? Thanks

4条回答
  •  面向向阳花
    2021-02-03 13:00

    Try to use the _to_java_object_rdd() function:

    import py4j.protocol  
    from py4j.protocol import Py4JJavaError  
    from py4j.java_gateway import JavaObject  
    from py4j.java_collections import JavaArray, JavaList
    
    from pyspark import RDD, SparkContext  
    from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    
    # your dataframe what you'd estimate
    df
    
    # Helper function to convert python object to Java objects
    def _to_java_object_rdd(rdd):  
        """ Return a JavaRDD of Object by unpickling
        It will convert each Python object into Java object by Pyrolite, whenever the
        RDD is serialized in batch or not.
        """
        rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
        return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
    
    # First you have to convert it to an RDD 
    JavaObj = _to_java_object_rdd(df.rdd)
    
    # Now we can run the estimator
    sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)
    

提交回复
热议问题