Filtering DataFrame using the length of a column

前端 未结 3 1422
耶瑟儿~
耶瑟儿~ 2020-12-02 22:28

I want to filter a DataFrame using a condition related to the length of a column, this question might be very easy but I didn\'t find any related question in th

相关标签:
3条回答
  • 2020-12-02 22:50

    @AlbertoBonsanto : below code filters based on array size:

    val input = Seq(("a1,a2,a3,a4,a5"), ("a1,a2,a3,a4"), ("a1,a2,a3"), ("a1,a2"), ("a1"))
    val df = sc.parallelize(input).toDF("tokens")
    val tokensArrayDf = df.withColumn("tokens", split($"tokens", ","))
    tokensArrayDf.show
    +--------------------+
    |              tokens|
    +--------------------+
    |[a1, a2, a3, a4, a5]|
    |    [a1, a2, a3, a4]|
    |        [a1, a2, a3]|
    |            [a1, a2]|
    |                [a1]|
    +--------------------+
    
    tokensArrayDf.filter(size($"tokens") > 3).show
    +--------------------+
    |              tokens|
    +--------------------+
    |[a1, a2, a3, a4, a5]|
    |    [a1, a2, a3, a4]|
    +--------------------+
    
    0 讨论(0)
  • 2020-12-02 22:52

    Here is an example for String in scala:

    val stringData = Seq(("Maheswara"), ("Mokshith"))
    val df = sc.parallelize(stringData).toDF
    df.where((length($"value")) <= 8).show
    +--------+
    |   value|
    +--------+
    |Mokshith|
    +--------+
    df.withColumn("length", length($"value")).show
    +---------+------+
    |    value|length|
    +---------+------+
    |Maheswara|     9|
    | Mokshith|     8|
    +---------+------+
    
    0 讨论(0)
  • 2020-12-02 23:05

    In Spark >= 1.5 you can use size function:

    from pyspark.sql.functions import col, size
    
    df = sqlContext.createDataFrame([
        (["L", "S", "Y", "S"],  ),
        (["L", "V", "I", "S"],  ),
        (["I", "A", "N", "A"],  ),
        (["I", "L", "S", "A"],  ),
        (["E", "N", "N", "Y"],  ),
        (["E", "I", "M", "A"],  ),
        (["O", "A", "N", "A"],  ),
        (["S", "U", "S"],  )], 
        ("tokens", ))
    
    df.where(size(col("tokens")) <= 3).show()
    
    ## +---------+
    ## |   tokens|
    ## +---------+
    ## |[S, U, S]|
    ## +---------+
    

    In Spark < 1.5 an UDF should do the trick:

    from pyspark.sql.types import IntegerType
    from pyspark.sql.functions import udf
    
    size_ = udf(lambda xs: len(xs), IntegerType())
    
    df.where(size_(col("tokens")) <= 3).show()
    
    ## +---------+
    ## |   tokens|
    ## +---------+
    ## |[S, U, S]|
    ## +---------+
    

    If you use HiveContext then size UDF with raw SQL should work with any version:

    df.registerTempTable("df")
    sqlContext.sql("SELECT * FROM df WHERE size(tokens) <= 3").show()
    
    ## +--------------------+
    ## |              tokens|
    ## +--------------------+
    ## |ArrayBuffer(S, U, S)|
    ## +--------------------+
    

    For string columns you can either use an udf defined above or length function:

    from pyspark.sql.functions import length
    
    df = sqlContext.createDataFrame([("fooo", ), ("bar", )], ("k", ))
    df.where(length(col("k")) <= 3).show()
    
    ## +---+
    ## |  k|
    ## +---+
    ## |bar|
    ## +---+
    
    0 讨论(0)
提交回复
热议问题