I would like to understand in which node (driver or worker/executor) does below code is stored
df.cache() //df is a large dataframe (200GB)
And
Just adding my 25 cents. A SparkDF.cache() would load the data in executor memory. It will not load in driver memory. Which is what's desired. Here's a snapshot of 50% of data load post a df.cache().count() I just ran.
Cache() persists in memory and disk as delineated by koiralo, and is also lazy evaluated.
Cachedtable() stores on disk and is resilient to node failures for this reason.
Credit: https://forums.databricks.com/answers/63/view.html