Correlated sub query column in SPARK SQL is not allowed as part of a non-equality predicate

前端 未结 1 890
没有蜡笔的小新
没有蜡笔的小新 2021-01-27 10:11

I am tryng to write a subquery in where clause like below. But i am getting \"Correlated column is not allowed in a non-equality predicate:\"

相关标签:
1条回答
  • 2021-01-27 10:32

    I did this with SCALA so you will need to convert but in a far easier way I think. I added a key and did at key level, you can adapt and aggr that out. But principle is far simpler. No correlated sub queries required. Just relational calculus. Used number for dates, etc.

    // SCALA 
    // Slightly ambiguous on hols vs. weekend, as you stated treated as 1
    
    import spark.implicits._ 
    import org.apache.spark.sql.functions._
    
    val dfE = Seq( 
                  ("NIC", 1, false, false),
                  ("NIC", 2, false, false),
                  ("NIC", 3, true, false),
                  ("NIC", 4, true, true),
                  ("NIC", 5, false, false),
                  ("NIC", 6, false, false),
                  ("XYZ", 1, false, true)
                  ).toDF("e","d","w", "h")
     //dfE.show(false)
    
     val dfE2 = dfE.withColumn("wh", when ($"w" or $"h", 1) otherwise (0)).drop("w").drop("h")
     //dfE2.show()
    
    //Assuming more dfD's can exist
    val dfD = Seq( 
                  ("NIC", 1, 4, "k1"),
                  ("NIC", 2, 3, "k2"),
                  ("NIC", 1, 1, "k3"),
                  ("NIC", 7, 10, "k4")
                  ).toDF("e","pd","dd", "k")
    //dfD.show(false)
    
    dfE2.createOrReplaceTempView("E2")
    dfD.createOrReplaceTempView("D1")
    
    // This done per record, if over identical keys, then strip k and aggr otherwise, I added k for checking each entry
    // Point is it is far easier. Key means synthetic grouping by.
    
    val q=sqlContext.sql(""" SELECT d1.k, d1.e, d1.pd, d1.dd, sum(e2.wh) 
                           FROM D1, E2
                          WHERE D1.e = E2.e 
                            AND E2.d >= D1.pd
                            AND E2.d <= D1.dd
                        GROUP BY d1.k, d1.e, d1.pd, d1.dd   
                        ORDER BY d1.k, d1.e, d1.pd, d1.dd
                         """)
    q.show
    

    returns:

     +---+---+---+---+-------+
     |  k|  e| pd| dd|sum(wh)|
     +---+---+---+---+-------+
     | k1|NIC|  1|  4|      2|
     | k2|NIC|  2|  3|      1|
     | k3|NIC|  1|  1|      0|
     +---+---+---+---+-------+
    

    I think a simple performance improvement can be made. No correlated stuff req'd in fact.

    Can use AND E2.d BETWEEN D1.pd AND D1.dd if you want.

    0 讨论(0)
提交回复
热议问题