For python dataframe, info() function provides memory usage. Is there any equivalent in pyspark ? Thanks
I have something in mind, its just a rough estimation. as far as i know spark doesn't have a straight forward way to get dataframe memory usage, But Pandas dataframe does. so what you can do is.
sample = df.sample(fraction = 0.01)
pdf = df.toPandas()
pdf.info()
Try to use the _to_java_object_rdd() function:
import py4j.protocol
from py4j.protocol import Py4JJavaError
from py4j.java_gateway import JavaObject
from py4j.java_collections import JavaArray, JavaList
from pyspark import RDD, SparkContext
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
# your dataframe what you'd estimate
df
# Helper function to convert python object to Java objects
def _to_java_object_rdd(rdd):
""" Return a JavaRDD of Object by unpickling
It will convert each Python object into Java object by Pyrolite, whenever the
RDD is serialized in batch or not.
"""
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
# First you have to convert it to an RDD
JavaObj = _to_java_object_rdd(df.rdd)
# Now we can run the estimator
sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)
How about below? It's in KB, X100 to get the estimated real size.
df.sample(fraction = 0.01).cache().count()
You can persist dataframe in memory and take action as df.count(). You would be able to check the size under storage tab on spark web ui.. let me know if it works for you.