Spark dataframe - Replace tokens of a common string with column values for each row using scala

问题

I have a dataframe with 3 columns - number (Integer), Name (String), Color (String). Below is the result of df.show with repartition option.

val df = sparkSession.read.format("csv").option("header", "true").option("inferschema", "true").option("delimiter", ",").option("decoding", "utf8").load(fileName).repartition(5).toDF()

+------+------+------+
|Number|  Name| Color|
+------+------+------+
|     4|Orange|Orange|
|     3| Apple| Green|
|     1| Apple|   Red|
|     2|Banana|Yellow|
|     5| Apple|   Red|
+------+------+------+

My objective is to create list of strings corresponding to each row by replacing the tokens in common dynamic string which I am passing as parameter to the method with the column values For example: commonDynamicString = Column.Name with Column.Color color

In this string, my tokens are Column.Name and Column.Color. I need to replace these values for all the rows with respective values in that column. Note: this string can change dynamically hence hardcoding won’t work.

I don't want to use RDD unless no other option is available with dataframe.

Below are the approaches I tried but couldn't achieve my objective.

Option 1:

val a = df.foreach(t => {
 finalValue = commonString.replace("Column.Number", t.getAs[Any]("Number").toString())
          .replace("DF.Name", t.getAs("Name"))
          .replace("DF.Color", t.getAs("Color"))

          println ("finalValue: " +finalValue)
          })

With this approach, the finalValue prints as expected. However, I cannot create a listbuffer or pass the final string from here as a list to other function as foreach returns Unit and spark throws error.

Option 2: I am thinking about this option but would need some guidance to understand if foldleft or window or any other spark functions can be used to create a 4th column called "Final" using withColumn option and use a UDF where I can extract all the tokens using regex pattern matching - "Column.\w+" and do replace operation for the tokens?

+------+------+------+--------------------------+
|Number|  Name| Color|      Final               |
+------+------+------+--------------------------+
|     4|Orange|Orange|Orange with orange color  |
|     3| Apple| Green|Apple with Green color    |
|     1| Apple|   Red|Apple with Red color      |
|     2|Banana|Yellow|Banana with Yellow color  |
|     5| Apple|   Red|Apple with Red color      |
+------+------+------+--------------------------+

Can someone help me with this problem and also to let me know if I am thinking in the right direction to use spark for handling large datasets?

Thanks!

回答1:

If I understand your requirement correctly, you can create a column method, say, parseStatement which takes a String-type statement and returns a Column with the following steps:

Parse the input statement to count number of tokens
Generate a Regex pattern in the form of ^(.*?)(token1)(.*?)(token2) ... (.*?)$
Apply pattern matching to assemble a colList consisting of lit(g1), col(g2), lit(g3), col(g4), ..., where the g?s are the extracted Regex groups
Concatenate the Column-type items

Here's the sample code:

import spark.implicits._
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._

def parseStatement(stmt: String): Column = {
  val token = "Column."
  val tokenPattern = """Column\.(\w+)"""
  val literalPattern = "(.*?)"
  val colCount = stmt.sliding(token.length).count(_ == token)

  val pattern = (0 to colCount * 2).map{
    case i if (i % 2 == 0) => literalPattern
    case _ => tokenPattern
  }.mkString

  val colList = ("^" + pattern + "$").r.findAllIn(stmt).
    matchData.toList.flatMap(_.subgroups).
    zipWithIndex.map{
      case (g, i) if (i % 2 == 0) => lit(g)
      case (g, i) => col(g)
  }

  concat(colList: _*)
}

val df = Seq(
  (4, "Orange", "Orange"),
  (3, "Apple", "Green"),
  (1, "Apple", "Red"),
  (2, "Banana", "Yellow"),
  (5, "Apple", "Red")
).toDF("Number", "Name", "Color")

val statement = "Column.Name with Column.Color color"

df.withColumn("Final", parseStatement(statement)).
  show(false)
// +------+------+------+------------------------+
// |Number|Name  |Color |Final                   |
// +------+------+------+------------------------+
// |4     |Orange|Orange|Orange with Orange color|
// |3     |Apple |Green |Apple with Green color  |
// |1     |Apple |Red   |Apple with Red color    |
// |2     |Banana|Yellow|Banana with Yellow color|
// |5     |Apple |Red   |Apple with Red color    |
// +------+------+------+------------------------+

Note that concat takes column-type parameters, hence the need of col() for column values and lit() for literals.

来源：https://stackoverflow.com/questions/52239473/spark-dataframe-replace-tokens-of-a-common-string-with-column-values-for-each

标签

scala

apache-spark

dataframe

apache-spark-sql

user-defined-functions