How to determine a dataframe size?
Right now I estimate the real size of a dataframe as follows:
headers_size = key for key in df.first().asDict()
ro
nice post from Tamas Szuromi http://metricbrew.com/how-to-estimate-rdd-or-dataframe-real-size-in-pyspark/
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
def _to_java_object_rdd(rdd):
""" Return a JavaRDD of Object by unpickling
It will convert each Python object into Java object by Pyrolite, whenever the
RDD is serialized in batch or not.
"""
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
JavaObj = _to_java_object_rdd(df.rdd)
nbytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)
Currently I am using the below approach, but not sure if this is the best way:
df.persist(StorageLevel.Memory)
df.count()
On the spark-web UI under the Storage tab you can check the size which is displayed in MB's and then I do unpersist to clear the memory:
df.unpersist()