where does df.cache() is stored

后端 未结 4 1834
清歌不尽
清歌不尽 2021-02-03 15:35

I would like to understand in which node (driver or worker/executor) does below code is stored

df.cache() //df is a large dataframe (200GB)

And

4条回答
  •  别跟我提以往
    2021-02-03 16:30

    The cache (or persist) method marks the DataFrame for caching in memory (or disk, if necessary, as the other answer says), but this happens only once an action is performed on the DataFrame, and only in a lazy fashion, i.e., if you ultimately read only 100 rows, only those 100 rows are cached. Creating a temporary table and using cacheTable is eager in the sense that it will cache the entire table immediately. Which is more performant depends on your situation. One thing that I've done with ordinary DataFrame cache is to immediately call .count() right after, forcing the DataFrame to be cached, and obviating the need to register a temp table and such.

提交回复
热议问题