Transforming Spark SQL AST with extraOptimizations

你。 提交于 2019-12-06 14:55:07

As you guessed, this is failing to work because we make assumptions that the optimizer will not change the results of the query.

Specifically, we cache the schema that comes out of the analyzer (and assume the optimizer does not change it). When translating rows to the external format, we use this schema and thus are truncating the columns in the result. If you did more than truncate (i.e. changed datatypes) this might even crash.

As you can see in this notebook, it is in fact producing the result you would expect under the covers. We are planning to open up more hooks at some point in the near future that would let you modify the plan at other phases of query execution. See SPARK-18127 for more details.

Michael Armbrust's answer confirmed that this kind of transformation shouldn't be done via optimisations.

I've instead used internal APIs in Spark to achieve the transformation I wanted for now. It requires methods that are package-private in Spark. So we can access them without reflection by putting the relevant logic in the appropriate package. In outline:

// Must be in the spark.sql package.
package org.apache.spark.sql

object SQLTransformer {
    def apply(sparkSession: SparkSession, ...) = {

        // Get the AST.
        val ast = sparkSession.sessionState.sqlParser.parsePlan(sql)

        // Transform the AST.
        val transformedAST = ast match {
            case node: Project => // Modify any top-level projection 
            ...
        }

        // Create a dataset directly from the AST.
        Dataset.ofRows(sparkSession, transformedAST)
    }
}

Note that this of course may break with future versions of Spark.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!