How to filter based on array value in PySpark?

前端 未结 2 1792
甜味超标
甜味超标 2020-12-16 15:58

My Schema:

|-- Canonical_URL: string (nullable = true)
 |-- Certifications: array (nullable = true)
 |    |-- elemen         


        
相关标签:
2条回答
  • 2020-12-16 16:09

    For equality based queries you can use array_contains:

    df = sc.parallelize([(1, [1, 2, 3]), (2, [4, 5, 6])]).toDF(["k", "v"])
    df.createOrReplaceTempView("df")
    
    # With SQL
    sqlContext.sql("SELECT * FROM df WHERE array_contains(v, 1)")
    
    # With DSL
    from pyspark.sql.functions import array_contains
    df.where(array_contains("v", 1))
    

    If you want to use more complex predicates you'll have to either explode or use an UDF, for example something like this:

    from pyspark.sql.types import BooleanType
    from pyspark.sql.functions import udf 
    
    def exists(f):
        return udf(lambda xs: any(f(x) for x in xs), BooleanType())
    
    df.where(exists(lambda x: x > 3)("v"))
    

    In Spark 2.4. or later it is also possible to use higher order functions

    from pyspark.sql.functions import expr
    
    df.where(expr("""aggregate(
        transform(v, x -> x > 3),
        false, 
        (x, y) -> x or y
    )"""))
    

    or

    df.where(expr("""
        exists(v, x -> x > 3)
    """))
    

    Python wrappers should be available in 3.1 (SPARK-30681).

    0 讨论(0)
  • 2020-12-16 16:18

    In spark 2.4 you can filter array values using filter function in sql API.

    https://spark.apache.org/docs/2.4.0/api/sql/index.html#filter

    Here's example in pyspark. In the example we filter out all array values which are empty strings:

    df = df.withColumn("ArrayColumn", expr("filter(ArrayColumn, x -> x != '')"))
    
    0 讨论(0)
提交回复
热议问题