How to use dataset to groupby

Deadly 提交于 2019-12-05 10:30:28
Ramesh Maharjan

I would suggest you to start with creating a case class as

case class Monkey(city: String, firstName: String)

This case class should be defined outside the main class. Then you can just use toDS function and use groupBy and aggregation function called collect_list as below

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val test = Seq(("New York", "Jack"),
  ("Los Angeles", "Tom"),
  ("Chicago", "David"),
  ("Houston", "John"),
  ("Detroit", "Michael"),
  ("Chicago", "Andrew"),
  ("Detroit", "Peter"),
  ("Detroit", "George")
)
sc.parallelize(test)
  .map(row => Monkey(row._1, row._2))
  .toDS()
  .groupBy("city")
  .agg(collect_list("firstName") as "list")
  .show(false)

You will have output as

+-----------+------------------------+
|city       |list                    |
+-----------+------------------------+
|Los Angeles|[Tom]                   |
|Detroit    |[Michael, Peter, George]|
|Chicago    |[David, Andrew]         |
|Houston    |[John]                  |
|New York   |[Jack]                  |
+-----------+------------------------+

You can always convert back to RDD by just calling .rdd function

To create a data set first define a case class outside your class as

case class Employee(city: String, name: String)

Then you can convert the list to Dataset as

  val spark =
    SparkSession.builder().master("local").appName("test").getOrCreate()
    import spark.implicits._
    val test = Seq(("New York", "Jack"),
    ("Los Angeles", "Tom"),
    ("Chicago", "David"),
    ("Houston", "John"),
    ("Detroit", "Michael"),
    ("Chicago", "Andrew"),
    ("Detroit", "Peter"),
    ("Detroit", "George")
    ).toDF("city", "name")
    val data = test.as[Employee]

Or

    import spark.implicits._
    val test = Seq(("New York", "Jack"),
      ("Los Angeles", "Tom"),
      ("Chicago", "David"),
      ("Houston", "John"),
      ("Detroit", "Michael"),
      ("Chicago", "Andrew"),
      ("Detroit", "Peter"),
      ("Detroit", "George")
    )

    val data = test.map(r => Employee(r._1, r._2)).toDS()

Now you can groupby and perform any aggregation as

data.groupBy("city").count().show

data.groupBy("city").agg(collect_list("name")).show

Hope this helps!

First i would turn your RDD into a DataSet:

val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._

val testDs = test.toDS()

Here you get your col names :) Use it wise !

testDs.schema.fields.foreach(x => println(x))

In the end you only need to use a groupBy:

testDs.groupBy("City?", "Name?")

RDD-s are not really the 2.0 version way I think. If you have any question please just ask.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!