How to pivot Spark DataFrame?

后端 未结 10 2105
闹比i
闹比i 2020-11-21 06:43

I am starting to use Spark DataFrames and I need to be able to pivot the data to create multiple columns out of 1 column with multiple rows. There is built in functionality

10条回答
  •  礼貌的吻别
    2020-11-21 07:14

    I have solved a similar problem using dataframes with the following steps:

    Create columns for all your countries, with 'value' as the value:

    import org.apache.spark.sql.functions._
    val countries = List("US", "UK", "Can")
    val countryValue = udf{(countryToCheck: String, countryInRow: String, value: Long) =>
      if(countryToCheck == countryInRow) value else 0
    }
    val countryFuncs = countries.map{country => (dataFrame: DataFrame) => dataFrame.withColumn(country, countryValue(lit(country), df("tag"), df("value"))) }
    val dfWithCountries = Function.chain(countryFuncs)(df).drop("tag").drop("value")
    

    Your dataframe 'dfWithCountries' will look like this:

    +--+--+---+---+
    |id|US| UK|Can|
    +--+--+---+---+
    | 1|50|  0|  0|
    | 1| 0|100|  0|
    | 1| 0|  0|125|
    | 2|75|  0|  0|
    | 2| 0|150|  0|
    | 2| 0|  0|175|
    +--+--+---+---+
    

    Now you can sum together all the values for your desired result:

    dfWithCountries.groupBy("id").sum(countries: _*).show
    

    Result:

    +--+-------+-------+--------+
    |id|SUM(US)|SUM(UK)|SUM(Can)|
    +--+-------+-------+--------+
    | 1|     50|    100|     125|
    | 2|     75|    150|     175|
    +--+-------+-------+--------+
    

    It's not a very elegant solution though. I had to create a chain of functions to add in all the columns. Also if I have lots of countries, I will expand my temporary data set to a very wide set with lots of zeroes.

提交回复
热议问题