Difference between Caching mechanism in Spark SQL

我是研究僧i 提交于 2019-12-22 11:20:27

问题


I am trying to wrap my head around various caching mechanisms in Spark SQL. Is there any difference between the following code snippets:

Method 1:

cache table test_cache AS
select a, b, c
from x
inner join y
on x.a = y.a;

Method 2:

create temporary view test_cache AS
select a, b, c
from x
inner join y
on x.a = y.a;

cache table test_cache;

Since computations in Spark are Lazy, will Spark cache the results the very first time the temp table is created in Method 2 ? Or will it wait for any collect is applied to it ?


回答1:


In Spark SQL there is a difference in caching if you use directly SQL or you use the DataFrame DSL. Using the DSL, the caching is lazy so after calling

my_df.cache()

the data is not cached in memory directly but only information about caching is added to the query plan and the data will be cached after calling some action on the DataFrame.

On the other hand using directly SQL as you do in your example, the caching is eager by default. So in your Method 1 a job will run immediately and the data will be put to the memory. In your Method 2 a job will run after calling the query with cache:

cache table test_cache;

Also using SQL, the caching can be made lazy as well by using lazy keyword explicitly:

cache lazy table test_cache;

In this case a job will not run immediately and the data will be put into memory after calling some action against the table test_cache.

To conclude, both your methods are equivalent in terms of caching and the data will be cached eagerly after running the block of the code.



来源:https://stackoverflow.com/questions/57419003/difference-between-caching-mechanism-in-spark-sql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!