问题
On the Storage tab of the PySparkShell application UI ([server]:8088) I can see information about an RDD I am using. One of the column is Fraction Cached.
How can I retrieve this percentage programatically?
I can use getStorageLevel()
to get some information about RDD caching but not Fraction Cached.
Do I have to calculate it myself?
回答1:
SparkContext.getRDDStorageInfo
is probably the thing you're looking for. It returns an Array
of RDDInfo which provides information about:
- Memory size.
- Total number of partitions.
- Number of cached partitions.
It is not directly exposed in PySpark so you'll have to be a bit creative:
from operator import truediv
storage_info = sc._jsc.sc().getRDDStorageInfo()
[{
"memSize": s.memSize(),
"numPartitions": s.numPartitions(),
"numCachedPartitions": s.numCachedPartitions(),
"fractionCached": truediv(s.numCachedPartitions(), s.numPartitions())
} for s in storage_info]
If you have access to the REST API you can of course use it directly:
import requests
url = "http://{0}:{1}/api/v1/applications/{2}/storage/rdd/".format(
host, port, sc.applicationId
)
[r.json() for r in [
requests.get("{0}{1}".format(url, rdd.get("id"))) for
rdd in requests.get(url).json()
] if r.status_code == 200]
来源:https://stackoverflow.com/questions/42003533/is-there-an-api-function-to-display-fraction-cached-for-an-rdd