Correlated sub query column in SPARK SQL is not allowed as part of a non-equality predicate

匿名 (未验证) 提交于 2019-12-03 01:42:02

问题:

I am tryng to write a subquery in where clause like below. But i am getting "Correlated column is not allowed in a non-equality predicate:"

SELECT *,     holidays        FROM   (            SELECT *,                   s.holidays,                   s.entity            FROM   transit_t tt            WHERE  (                          SELECT Count(thedate) AS holidays                          FROM   fact_ent_rt                          WHERE  entity=tt.awborigin                          AND    (                                        Substring(thedate,1,10)) BETWEEN (Substring(awbpickupdate,1,10)) AND    (                                       Substring(deliverydate,1,10))                          AND    (                                        nholidayflag = true                                 OR     weekendflag = true))) s 

Any issues with this query. because i thought spark >2.0 supported subqueries in where clause. any suggestions would be appreciated. Thanks

The Input will be pick up date and delivery date from transit table. We need to find if there is a weekend falling between these dates(this data is available in fact_ent_rt ) and take the count of no of holidays.

The output i am getting is pyspark.sql.utils.AnalysisException: u"Correlated column is not allowed in a non-equality predicate:\nAggregate

Sample inputs:

Input 1: +---------++-------------------+-------------------+

|AWBOrigin||      AWBPickupDate|       DeliveryDate|  +---------++-------------------+-------------------+  |      LON||2018-09-01 08:52:00|2018-09-12 13:57:00| |      DHA||2018-09-04 11:47:00|2018-09-08 07:30:00| |      NIC||2009-01-01 01:47:00|2009-01-09 11:37:00 +---------+-----------+----------------------------- 

Input 2 (fact_ent)

------+-------------------+-----------+------------   Entity|            TheDate|WeekendFlag|NHolidayFlag   ------+-------------------+-----------+------------  NIC|2009-01-01 00:00:00|      False|       False NIC|2009-01-02 00:00:00|      False|       False NIC|2009-01-03 00:00:00|       True|       False NIC|2009-01-04 00:00:00|       True|       False NIC|2009-01-05 00:00:00|      False|       False NIC|2009-01-06 00:00:00|      False|       False NIC|2009-01-07 00:00:00|      False|       False NIC|2009-01-08 00:00:00|      False|       False NIC|2009-01-09 00:00:00|      False|       False NIC|2009-01-10 00:00:00|       True|       False NIC|2009-01-11 00:00:00|       True|       False NIC|2009-01-12 00:00:00|      False|       False NIC|2009-01-13 00:00:00|      False|       False NIC|2009-01-14 00:00:00|      False|       False NIC|2009-01-15 00:00:00|      False|       False NIC|2009-01-16 00:00:00|      False|       False NIC|2009-01-17 00:00:00|       True|       False NIC|2009-01-18 00:00:00|       True|       False NIC|2009-01-19 00:00:00|      False|       False NIC|2009-01-20 00:00:00|      False|       False ------+-------------------+-----------+------------ 

expectede Output

 +---------++-------------------+-------------------+  |AWBOrigin||      AWBPickupDate|       DeliveryDate| Holidays  +---------++-------------------+-------------------+  |      LON||2018-09-01 08:52:00|2018-09-12 13:57:00|  NA |      DHA||2018-09-04 11:47:00|2018-09-08 07:30:00|  NA |      NIC||2009-01-01 01:47:00|2009-01-09 11:37:00|  2 +---------+-----------+----------------------------- 

回答1:

I did this with SCALA so you will need to convert but in a far easier way I think. I added a key and did at key level, you can adapt and aggr that out. But principle is far simpler. No correlated sub queries required. Just relational calculus. Used number for dates, etc.

// SCALA  // Slightly ambiguous on hols vs. weekend, as you stated treated as 1  import spark.implicits._  import org.apache.spark.sql.functions._  val dfE = Seq(                ("NIC", 1, false, false),               ("NIC", 2, false, false),               ("NIC", 3, true, false),               ("NIC", 4, true, true),               ("NIC", 5, false, false),               ("NIC", 6, false, false),               ("XYZ", 1, false, true)               ).toDF("e","d","w", "h")  //dfE.show(false)   val dfE2 = dfE.withColumn("wh", when ($"w" or $"h", 1) otherwise (0)).drop("w").drop("h")  //dfE2.show()  //Assuming more dfD's can exist val dfD = Seq(                ("NIC", 1, 4, "k1"),               ("NIC", 2, 3, "k2"),               ("NIC", 1, 1, "k3"),               ("NIC", 7, 10, "k4")               ).toDF("e","pd","dd", "k") //dfD.show(false)  dfE2.createOrReplaceTempView("E2") dfD.createOrReplaceTempView("D1")  // This done per record, if over identical keys, then strip k and aggr otherwise, I added k for checking each entry // Point is it is far easier. Key means synthetic grouping by.  val q=sqlContext.sql(""" SELECT d1.k, d1.e, d1.pd, d1.dd, sum(e2.wh)                         FROM D1, E2                       WHERE D1.e = E2.e                          AND E2.d >= D1.pd                         AND E2.d <= D1.dd                     GROUP BY d1.k, d1.e, d1.pd, d1.dd                        ORDER BY d1.k, d1.e, d1.pd, d1.dd                      """) q.show 

returns:

 +---+---+---+---+-------+  |  k|  e| pd| dd|sum(wh)|  +---+---+---+---+-------+  | k1|NIC|  1|  4|      2|  | k2|NIC|  2|  3|      1|  | k3|NIC|  1|  1|      0|  +---+---+---+---+-------+ 

I think a simple performance improvement can be made. No correlated stuff req'd in fact.

Can use AND E2.d BETWEEN D1.pd AND D1.dd if you want.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!