Difference between Caching mechanism in Spark SQL

问题

I am trying to wrap my head around various caching mechanisms in Spark SQL. Is there any difference between the following code snippets:

Method 1:

cache table test_cache AS
select a, b, c
from x
inner join y
on x.a = y.a;

Method 2:

create temporary view test_cache AS
select a, b, c
from x
inner join y
on x.a = y.a;

cache table test_cache;

Since computations in Spark are Lazy, will Spark cache the results the very first time the temp table is created in Method 2 ? Or will it wait for any collect is applied to it ?

回答1:

In Spark SQL there is a difference in caching if you use directly SQL or you use the DataFrame DSL. Using the DSL, the caching is lazy so after calling

my_df.cache()

the data is not cached in memory directly but only information about caching is added to the query plan and the data will be cached after calling some action on the DataFrame.

On the other hand using directly SQL as you do in your example, the caching is eager by default. So in your Method 1 a job will run immediately and the data will be put to the memory. In your Method 2 a job will run after calling the query with cache:

cache table test_cache;

Also using SQL, the caching can be made lazy as well by using lazy keyword explicitly:

cache lazy table test_cache;

In this case a job will not run immediately and the data will be put into memory after calling some action against the table test_cache.

To conclude, both your methods are equivalent in terms of caching and the data will be cached eagerly after running the block of the code.

来源：https://stackoverflow.com/questions/57419003/difference-between-caching-mechanism-in-spark-sql

标签

apache-spark

pyspark

apache-spark-sql

pyspark-sql