I would like to understand in which node (driver or worker/executor) does below code is stored
df.cache() //df is a large dataframe (200GB)
And
The cache
(or persist
) method marks the DataFrame for caching in memory (or disk, if necessary, as the other answer says), but this happens only once an action is performed on the DataFrame, and only in a lazy fashion, i.e., if you ultimately read only 100 rows, only those 100 rows are cached. Creating a temporary table and using cacheTable
is eager in the sense that it will cache the entire table immediately. Which is more performant depends on your situation. One thing that I've done with ordinary DataFrame cache
is to immediately call .count()
right after, forcing the DataFrame to be cached, and obviating the need to register a temp table and such.