How to perform a Switch statement with Apache Spark Dataframes (Python)

后端未结

关注

 1  1600

I\'m trying to perform an operation on my data where a certain value will be mapped to a list of pre-determined values if it matches one of the criteria, or to a fall-through va

相关标签:

1条回答

庸人自扰

2021-01-21 22:01

If you want you can even use your SQL expression directly:

expr = """
    CASE
        WHEN user_agent LIKE \'%Android%\' THEN \'mobile\'
        WHEN user_agent LIKE \'%Linux%\' THEN \'desktop\'
        ELSE \'other_unknown\'
    END AS user_agent_type"""

df = sc.parallelize([
    (1, "Android"), (2, "Linux"), (3, "Foo")
]).toDF(["id", "user_agent"])

df.selectExpr("*", expr).show()
## +---+----------+---------------+
## | id|user_agent|user_agent_type|
## +---+----------+---------------+
## |  1|   Android|         mobile|
## |  2|     Linux|        desktop|
## |  3|       Foo|  other_unknown|
## +---+----------+---------------+

otherwise you can replace it with a combination of when and like and otherwise:

from pyspark.sql.functions import col, when
from functools import reduce

c = col("user_agent")
vs = [("Android", "mobile"), ("Linux", "desktop")]
expr = reduce(
    lambda acc, kv: when(c.like(kv[0]), kv[1]).otherwise(acc), 
    vs, 
    "other_unknown"
).alias("user_agent_type")

df.select("*", expr).show()

## +---+----------+---------------+
## | id|user_agent|user_agent_type|
## +---+----------+---------------+
## |  1|   Android|         mobile|
## |  2|     Linux|        desktop|
## |  3|       Foo|  other_unknown|
## +---+----------+---------------+

You can also add multiple columns in a single select:

exprs = [c.alias(a) for (a, c) in [
  ('etl_requests_usage', lit('DEV')), 
  ('etl_datetime_local', current_date())]]

df.select("*", *exprs)

0 讨论(0)