问题
I have a dataframe with 3 columns - number (Integer), Name (String), Color (String). Below is the result of df.show with repartition option.
val df = sparkSession.read.format("csv").option("header", "true").option("inferschema", "true").option("delimiter", ",").option("decoding", "utf8").load(fileName).repartition(5).toDF()
+------+------+------+
|Number| Name| Color|
+------+------+------+
| 4|Orange|Orange|
| 3| Apple| Green|
| 1| Apple| Red|
| 2|Banana|Yellow|
| 5| Apple| Red|
+------+------+------+
My objective is to create list of strings corresponding to each row by replacing the tokens in common dynamic string which I am passing as parameter to the method with the column values For example: commonDynamicString = Column.Name with Column.Color color
In this string, my tokens are Column.Name and Column.Color. I need to replace these values for all the rows with respective values in that column. Note: this string can change dynamically hence hardcoding won’t work.
I don't want to use RDD unless no other option is available with dataframe.
Below are the approaches I tried but couldn't achieve my objective.
Option 1:
val a = df.foreach(t => {
finalValue = commonString.replace("Column.Number", t.getAs[Any]("Number").toString())
.replace("DF.Name", t.getAs("Name"))
.replace("DF.Color", t.getAs("Color"))
println ("finalValue: " +finalValue)
})
With this approach, the finalValue prints as expected. However, I cannot create a listbuffer or pass the final string from here as a list to other function as foreach returns Unit and spark throws error.
Option 2: I am thinking about this option but would need some guidance to understand if foldleft or window or any other spark functions can be used to create a 4th column called "Final" using withColumn option and use a UDF where I can extract all the tokens using regex pattern matching - "Column.\w+" and do replace operation for the tokens?
+------+------+------+--------------------------+
|Number| Name| Color| Final |
+------+------+------+--------------------------+
| 4|Orange|Orange|Orange with orange color |
| 3| Apple| Green|Apple with Green color |
| 1| Apple| Red|Apple with Red color |
| 2|Banana|Yellow|Banana with Yellow color |
| 5| Apple| Red|Apple with Red color |
+------+------+------+--------------------------+
Can someone help me with this problem and also to let me know if I am thinking in the right direction to use spark for handling large datasets?
Thanks!
回答1:
If I understand your requirement correctly, you can create a column method, say, parseStatement
which takes a String
-type statement and returns a Column
with the following steps:
- Parse the input
statement
to count number of tokens - Generate a Regex pattern in the form of
^(.*?)(token1)(.*?)(token2) ... (.*?)$
- Apply pattern matching to assemble a
colList
consisting of lit(g1
), col(g2
), lit(g3
), col(g4
), ..., where theg?
s are the extracted Regex groups - Concatenate the Column-type items
Here's the sample code:
import spark.implicits._
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def parseStatement(stmt: String): Column = {
val token = "Column."
val tokenPattern = """Column\.(\w+)"""
val literalPattern = "(.*?)"
val colCount = stmt.sliding(token.length).count(_ == token)
val pattern = (0 to colCount * 2).map{
case i if (i % 2 == 0) => literalPattern
case _ => tokenPattern
}.mkString
val colList = ("^" + pattern + "$").r.findAllIn(stmt).
matchData.toList.flatMap(_.subgroups).
zipWithIndex.map{
case (g, i) if (i % 2 == 0) => lit(g)
case (g, i) => col(g)
}
concat(colList: _*)
}
val df = Seq(
(4, "Orange", "Orange"),
(3, "Apple", "Green"),
(1, "Apple", "Red"),
(2, "Banana", "Yellow"),
(5, "Apple", "Red")
).toDF("Number", "Name", "Color")
val statement = "Column.Name with Column.Color color"
df.withColumn("Final", parseStatement(statement)).
show(false)
// +------+------+------+------------------------+
// |Number|Name |Color |Final |
// +------+------+------+------------------------+
// |4 |Orange|Orange|Orange with Orange color|
// |3 |Apple |Green |Apple with Green color |
// |1 |Apple |Red |Apple with Red color |
// |2 |Banana|Yellow|Banana with Yellow color|
// |5 |Apple |Red |Apple with Red color |
// +------+------+------+------------------------+
Note that concat
takes column-type parameters, hence the need of col() for column values and lit() for literals.
来源:https://stackoverflow.com/questions/52239473/spark-dataframe-replace-tokens-of-a-common-string-with-column-values-for-each