问题
Let's assume that I have the following Dataset
:
+-----------+----------+
|productCode| amount|
+-----------+----------+
| XX-13| 300|
| XX-1| 250|
| XX-2| 410|
| XX-9| 50|
| XX-10| 35|
| XX-100| 870|
+-----------+----------+
Where productCode
is of String
type and the amount
is an Int
.
If one will try to order this by productCode
the result will be (and this is expected because of nature of String
comparison):
def orderProducts(product: Dataset[Product]): Dataset[Product] = {
product.orderBy("productCode")
}
// Output:
+-----------+----------+
|productCode| amount|
+-----------+----------+
| XX-1| 250|
| XX-10| 35|
| XX-100| 870|
| XX-13| 300|
| XX-2| 410|
| XX-9| 50|
+-----------+----------+
How can I get an output ordered by Integer
part of the productCode
like below considering Dataset
API?
+-----------+----------+
|productCode| amount|
+-----------+----------+
| XX-1| 250|
| XX-2| 410|
| XX-9| 50|
| XX-10| 35|
| XX-13| 300|
| XX-100| 870|
+-----------+----------+
回答1:
Use the expression in the orderBy. Check this out:
scala> val df = Seq(("XX-13",300),("XX-1",250),("XX-2",410),("XX-9",50),("XX-10",35),("XX-100",870)).toDF("productCode", "amt")
df: org.apache.spark.sql.DataFrame = [productCode: string, amt: int]
scala> df.orderBy(split('productCode,"-")(1).cast("int")).show
+-----------+---+
|productCode|amt|
+-----------+---+
| XX-1|250|
| XX-2|410|
| XX-9| 50|
| XX-10| 35|
| XX-13|300|
| XX-100|870|
+-----------+---+
scala>
With window functions, you could do like
scala> df.withColumn("row1",row_number().over(Window.orderBy(split('productCode,"-")(1).cast("int")))).show(false)
18/12/10 09:25:07 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+-----------+---+----+
|productCode|amt|row1|
+-----------+---+----+
|XX-1 |250|1 |
|XX-2 |410|2 |
|XX-9 |50 |3 |
|XX-10 |35 |4 |
|XX-13 |300|5 |
|XX-100 |870|6 |
+-----------+---+----+
scala>
Note that spark complains of moving all data to single partition.
来源:https://stackoverflow.com/questions/53707368/sorting-numeric-string-in-spark-dataset