How to force DataFrame evaluation in Spark

后端 未结 4 851
感情败类
感情败类 2020-11-28 15:27

Sometimes (e.g. for testing and bechmarking) I want force the execution of the transformations defined on a DataFrame. AFAIK calling an action like count does n

相关标签:
4条回答
  • 2020-11-28 15:31

    I prefer to use df.save.parquet(). This does add disc I/o time that you can estimate and subtract out later, but you are positive that spark performed each step you expected and did not trick you with lazy evaluation.

    0 讨论(0)
  • 2020-11-28 15:46

    I guess simply getting an underlying rdd from DataFrame and triggering an action on it should achieve what you're looking for.

    df.withColumn("test",myUDF($"id")).rdd.count // this gives proper exceptions
    
    0 讨论(0)
  • 2020-11-28 15:46

    It's a bit late, but here's the fundamental reason: count does not act the same on RDD and DataFrame.

    In DataFrames there's an optimization, as in some cases you do not require to load data to actually know the number of elements it has (especially in the case of yours where there's no data shuffling involved). Hence, the DataFrame materialized when count is called will not load any data and will not pass into your exception throwing. You can easily do the experiment by defining your own DefaultSource and Relation and see that calling count on a DataFrame will always end up in the method buildScan with no requiredColumns no matter how many columns you did select (cf. org.apache.spark.sql.sources.interfaces to understand more). It's actually a very efficient optimization ;-)

    In RDDs though, there's no such optimizations (that's why one should always try to use DataFrames when possible). Hence the count on RDD executes all the lineage and returns the sum of all sizes of the iterators composing any partitions.

    Calling dataframe.count goes into the first explanation, but calling dataframe.rdd.count goes into the second as you did build an RDD out of your DataFrame. Note that calling dataframe.cache().count forces the dataframe to be materialized as you required Spark to cache the results (hence it needs to load all the data and transform it). But it does have the side-effect of caching your data...

    0 讨论(0)
  • 2020-11-28 15:53

    It appears that df.cache.count is the way to go:

    scala> val myUDF = udf((i:Int) => {if(i==1000) throw new RuntimeException;i})
    myUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(IntegerType)))
    
    scala> val df = sc.parallelize(1 to 1000).toDF("id")
    df: org.apache.spark.sql.DataFrame = [id: int]
    
    scala> df.withColumn("test",myUDF($"id")).show(10)
    [rdd_51_0]
    +---+----+
    | id|test|
    +---+----+
    |  1|   1|
    |  2|   2|
    |  3|   3|
    |  4|   4|
    |  5|   5|
    |  6|   6|
    |  7|   7|
    |  8|   8|
    |  9|   9|
    | 10|  10|
    +---+----+
    only showing top 10 rows
    
    scala> df.withColumn("test",myUDF($"id")).count
    res13: Long = 1000
    
    scala> df.withColumn("test",myUDF($"id")).cache.count
    org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (int) => int)
            at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    .
    .
    .
    Caused by: java.lang.RuntimeException
    

    Source

    0 讨论(0)
提交回复
热议问题