Scala — Conditional replace column value of a data frame

问题

DataFrame 1 is what I have now, and I want to write a Scala function to make DataFrame 1 look like DataFrame 2.

Transfer is the big category; e-transfer and IMT are subcategories.

The Logic is that for a same ID (31898), if both Transfer and e-Transfer tagged to it, it should only be e-Transfer; if Transfer and IMT and e-Transfer all tagged to a same ID (32614), it should be e-Transfer + IMT; If only Transfer tagged to one ID (33987), it should be Other; if only e-Transfer or IMT tagged to a ID (34193), it should just be e-transfer pr IMT.

New to scala, don't know how to write a good function to do this. Please help!!

DataFrame 1                        DataFrame 2
+---------+-------------+          +---------+------------------+
|   ID    | Category    |          |   ID    | Category         |
+---------+-------------+          +---------+------------------+  
|  31898  |   Transfer  |          |  31898  |  e-Transfer      |  
|  31898  |  e-Transfer |          |  32614  |  e-Transfer + IMT|
|  32614  |   Transfer  |  =====>  |  33987  |   Other          |
|  32614  |  e-Transfer |  =====>  |  34193  |  e-Transfer      |
|  32614  |     IMT     |          +---------+------------------+
|  33987  |   Transfer  |  
|  34193  |  e-Transfer |  
+---------+-------------+

回答1:

You can group the DataFrame by ID to aggregate Category using collect_set to assemble arrays of categories, and create a new column based on content in the category arrays using array_contains:

import org.apache.spark.sql.functions._

val df = Seq(
  (31898, "Transfer"),
  (31898, "e-Transfer"),
  (32614, "Transfer"),
  (32614, "e-Transfer"),
  (32614, "IMT"),
  (33987, "Transfer"),
  (34193, "e-Transfer")
).toDF("ID", "Category")

df.groupBy("ID").agg(collect_set("Category").as("CategorySet")).
  withColumn( "Category",
    when(array_contains($"CategorySet", "e-Transfer") && array_contains($"CategorySet", "IMT"),
      "e-Transfer + IMT").otherwise(
    when(array_contains($"CategorySet", "e-Transfer") && array_contains($"CategorySet", "Transfer"),
      "e-Transfer").otherwise(
    when($"CategorySet" === Array("e-Transfer") || $"CategorySet" === Array("MIT"),
      $"CategorySet"(0)).otherwise(
    when($"CategorySet" === Array("Transfer"), "Other")
    )))
  ).
  show(false)
// +-----+---------------------------+----------------+
// |ID   |CategorySet                |Category        |
// +-----+---------------------------+----------------+
// |33987|[Transfer]                 |Other           |
// |32614|[Transfer, e-Transfer, IMT]|e-Transfer + IMT|
// |34193|[e-Transfer]               |e-Transfer      |
// |31898|[Transfer, e-Transfer]     |e-Transfer      |
// +-----+---------------------------+----------------+

Your sample data might not have covered all cases (e.g. [Transfer, MIT]). The existing sample code would generate null category value for any remaining cases. Simply modify/expand the conditional check if additional cases are identified.

来源：https://stackoverflow.com/questions/52008179/scala-conditional-replace-column-value-of-a-data-frame

标签

scala

apache-spark

dataframe

user-defined-functions