I\'m trying to perform an operation on my data where a certain value will be mapped to a list of pre-determined values if it matches one of the criteria, or to a fall-through va
If you want you can even use your SQL
expression directly:
expr = """
CASE
WHEN user_agent LIKE \'%Android%\' THEN \'mobile\'
WHEN user_agent LIKE \'%Linux%\' THEN \'desktop\'
ELSE \'other_unknown\'
END AS user_agent_type"""
df = sc.parallelize([
(1, "Android"), (2, "Linux"), (3, "Foo")
]).toDF(["id", "user_agent"])
df.selectExpr("*", expr).show()
## +---+----------+---------------+
## | id|user_agent|user_agent_type|
## +---+----------+---------------+
## | 1| Android| mobile|
## | 2| Linux| desktop|
## | 3| Foo| other_unknown|
## +---+----------+---------------+
otherwise you can replace it with a combination of when
and like
and otherwise
:
from pyspark.sql.functions import col, when
from functools import reduce
c = col("user_agent")
vs = [("Android", "mobile"), ("Linux", "desktop")]
expr = reduce(
lambda acc, kv: when(c.like(kv[0]), kv[1]).otherwise(acc),
vs,
"other_unknown"
).alias("user_agent_type")
df.select("*", expr).show()
## +---+----------+---------------+
## | id|user_agent|user_agent_type|
## +---+----------+---------------+
## | 1| Android| mobile|
## | 2| Linux| desktop|
## | 3| Foo| other_unknown|
## +---+----------+---------------+
You can also add multiple columns in a single select
:
exprs = [c.alias(a) for (a, c) in [
('etl_requests_usage', lit('DEV')),
('etl_datetime_local', current_date())]]
df.select("*", *exprs)