How to use NOT IN clause in filter condition in spark

前端 未结 3 1907
一整个雨季
一整个雨季 2021-02-04 06:09

I want to filter a column of an RDD source :

val source = sql(\"SELECT * from sample.source\").rdd.map(_.mkString(\",\"))
val destination = sql(\"select * from          


        
3条回答
  •  抹茶落季
    2021-02-04 06:59

    Since your code isn't reproducible, here is a small example using spark-sql on how to select * from t where id in (...) :

    // create a DataFrame for a range 'id' from 1 to 9.
    scala> val df = spark.range(1,10).toDF
    df: org.apache.spark.sql.DataFrame = [id: bigint]
    
    // values to exclude
    scala> val f = Seq(5,6,7)
    f: Seq[Int] = List(5, 6, 7)
    
    // select * from df where id is not in the values to exclude
    scala> df.filter(!col("id").isin(f  : _*)).show
    +---+                                                                           
    | id|
    +---+
    |  1|
    |  2|
    |  3|
    |  4|
    |  8|
    |  9|
    +---+
    
    // select * from df where id is in the values to exclude
    scala> df.filter(col("id").isin(f  : _*)).show
    

    Here is the RDD version of the not isin :

    scala> val rdd = sc.parallelize(1 to 10)
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at :24
    
    scala> val f = Seq(5,6,7)
    f: Seq[Int] = List(5, 6, 7)
    
    scala> val rdd2 = rdd.filter(x => !f.contains(x))
    rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[3] at filter at :28
    

    Nevertheless, I still believe this is an overkill since you are already using spark-sql.

    It seems in your case that you are actually dealing with DataFrames, thus the solutions mentioned above don't work.

    You can use the left anti join approach :

    scala> val source = spark.read.format("csv").load("source.file")
    source: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 9 more fields]
    
    scala> val destination = spark.read.format("csv").load("destination.file")
    destination: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 9 more fields]
    
    scala> source.show
    +---+------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
    |_c0|               _c1|     _c2|       _c3|            _c4|_c5|_c6|       _c7|  _c8|      _c9|        _c10|
    +---+------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
    |  1|        Ravi kumar|   Ravi |     kumar|           MSO |  1|  M|17-01-1994| 74.5| 24000.78|    Alabama |
    |  2|Shekhar shudhanshu| Shekhar|shudhanshu|      Manulife |  2|  M|18-01-1994|76.34|   250000|     Alaska |
    |  3|Preethi Narasingam| Preethi|Narasingam|        Retail |  3|  F|19-01-1994|77.45|270000.01|    Arizona |
    |  4|     Abhishek Nair|Abhishek|      Nair|       Banking |  4|  M|20-01-1994|78.65|   345000|   Arkansas |
    |  5|        Ram Sharma|     Ram|    Sharma|Infrastructure |  5|  M|21-01-1994|79.12|    45000| California |
    |  6|   Chandani Kumari|Chandani|    Kumari|          BNFS |  6|  F|22-01-1994|80.13| 43000.02|   Colorado |
    |  7|      Balaji Kumar|  Balaji|     Kumar|           MSO |  1|  M|23-01-1994|81.33|  1234678|Connecticut |
    |  8|  Naveen Shekrappa|  Naveen| Shekrappa|      Manulife |  2|  M|24-01-1994|  100|   789414|   Delaware |
    |  9|     Milind Chavan|  Milind|    Chavan|        Retail |  3|  M|25-01-1994|83.66|   245555|    Florida |
    | 10|      Raghu Rajeev|   Raghu|    Rajeev|       Banking |  4|  M|26-01-1994|87.65|   235468|     Georgia|
    +---+------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
    
    
    scala> destination.show
    +---+-------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
    |_c0|                _c1|     _c2|       _c3|            _c4|_c5|_c6|       _c7|  _c8|      _c9|        _c10|
    +---+-------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
    |  1|         Ravi kumar|   Revi |     kumar|           MSO |  1|  M|17-01-1994| 74.5| 24000.78|    Alabama |
    |  1|        Ravi1 kumar|   Revi |     kumar|           MSO |  1|  M|17-01-1994| 74.5| 24000.78|    Alabama |
    |  1|        Ravi2 kumar|   Revi |     kumar|           MSO |  1|  M|17-01-1994| 74.5| 24000.78|    Alabama |
    |  2| Shekhar shudhanshu| Shekhar|shudhanshu|      Manulife |  2|  M|18-01-1994|76.34|   250000|     Alaska |
    |  3|Preethi Narasingam1| Preethi|Narasingam|        Retail |  3|  F|19-01-1994|77.45|270000.01|    Arizona |
    |  4|     Abhishek Nair1|Abhishek|      Nair|       Banking |  4|  M|20-01-1994|78.65|   345000|   Arkansas |
    |  5|         Ram Sharma|     Ram|    Sharma|Infrastructure |  5|  M|21-01-1994|79.12|    45000| California |
    |  6|    Chandani Kumari|Chandani|    Kumari|          BNFS |  6|  F|22-01-1994|80.13| 43000.02|   Colorado |
    |  7|       Balaji Kumar|  Balaji|     Kumar|           MSO |  1|  M|23-01-1994|81.33|  1234678|Connecticut |
    |  8|   Naveen Shekrappa|  Naveen| Shekrappa|      Manulife |  2|  M|24-01-1994|  100|   789414|   Delaware |
    |  9|      Milind Chavan|  Milind|    Chavan|        Retail |  3|  M|25-01-1994|83.66|   245555|    Florida |
    | 10|       Raghu Rajeev|   Raghu|    Rajeev|       Banking |  4|  M|26-01-1994|87.65|   235468|     Georgia|
    +---+-------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
    

    You'll just need to do the following :

    scala> val res1 = source.join(destination, Seq("_c0"), "leftanti")
    
    scala> val res2 = destination.join(source, Seq("_c0"), "leftanti")
    

    It's the same logic I mentioned in my answer here.

提交回复
热议问题