How to perform a Switch statement with Apache Spark Dataframes (Python)

后端 未结 1 1600
执笔经年
执笔经年 2021-01-21 21:27

I\'m trying to perform an operation on my data where a certain value will be mapped to a list of pre-determined values if it matches one of the criteria, or to a fall-through va

相关标签:
1条回答
  • 2021-01-21 22:01

    If you want you can even use your SQL expression directly:

    expr = """
        CASE
            WHEN user_agent LIKE \'%Android%\' THEN \'mobile\'
            WHEN user_agent LIKE \'%Linux%\' THEN \'desktop\'
            ELSE \'other_unknown\'
        END AS user_agent_type"""
    
    df = sc.parallelize([
        (1, "Android"), (2, "Linux"), (3, "Foo")
    ]).toDF(["id", "user_agent"])
    
    df.selectExpr("*", expr).show()
    ## +---+----------+---------------+
    ## | id|user_agent|user_agent_type|
    ## +---+----------+---------------+
    ## |  1|   Android|         mobile|
    ## |  2|     Linux|        desktop|
    ## |  3|       Foo|  other_unknown|
    ## +---+----------+---------------+
    

    otherwise you can replace it with a combination of when and like and otherwise:

    from pyspark.sql.functions import col, when
    from functools import reduce
    
    c = col("user_agent")
    vs = [("Android", "mobile"), ("Linux", "desktop")]
    expr = reduce(
        lambda acc, kv: when(c.like(kv[0]), kv[1]).otherwise(acc), 
        vs, 
        "other_unknown"
    ).alias("user_agent_type")
    
    df.select("*", expr).show()
    
    ## +---+----------+---------------+
    ## | id|user_agent|user_agent_type|
    ## +---+----------+---------------+
    ## |  1|   Android|         mobile|
    ## |  2|     Linux|        desktop|
    ## |  3|       Foo|  other_unknown|
    ## +---+----------+---------------+
    

    You can also add multiple columns in a single select:

    exprs = [c.alias(a) for (a, c) in [
      ('etl_requests_usage', lit('DEV')), 
      ('etl_datetime_local', current_date())]]
    
    df.select("*", *exprs)
    
    0 讨论(0)
提交回复
热议问题