How to estimate dataframe real size in pyspark?

删除回忆录丶 提交于 2019-11-27 01:10:28

问题


How to determine a dataframe size?

Right now I estimate the real size of a dataframe as follows:

headers_size = key for key in df.first().asDict()
rows_size = df.map(lambda row: len(value for key, value in row.asDict()).sum()
total_size = headers_size + rows_size

It is too slow and I'm looking for a better way.


回答1:


nice post from Tamas Szuromi http://metricbrew.com/how-to-estimate-rdd-or-dataframe-real-size-in-pyspark/

from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
def _to_java_object_rdd(rdd):  
    """ Return a JavaRDD of Object by unpickling
    It will convert each Python object into Java object by Pyrolite, whenever the
    RDD is serialized in batch or not.
    """
    rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
    return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)

JavaObj = _to_java_object_rdd(df.rdd)

nbytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)



回答2:


Currently I am using the below approach, not sure if this is the best way

df.persist(StorageLevel.Memory) df.count()

On the spark-web ui under the Storage tab you can check the size which is displayed in MB's and then I do unpersist to clear the memory.

df.unpersist()



来源:https://stackoverflow.com/questions/37077432/how-to-estimate-dataframe-real-size-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!