问题
I have a request to use rdd to do so:
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
("Detroit", "George")
)
sc.parallelize(test).groupByKey().mapValues(_.toList).foreach(println)
The result is that:
(New York,List(Jack))
(Detroit,List(Michael, Peter, George))
(Los Angeles,List(Tom))
(Houston,List(John))
(Chicago,List(David, Andrew))
How to do it use dataset with spark2.0?
I have a way to use a custom function, but the feeling is so complicated, there is no simple point method?
回答1:
I would suggest you to start with creating a case class
as
case class Monkey(city: String, firstName: String)
This case class
should be defined outside the main class. Then you can just use toDS
function and use groupBy
and aggregation
function called collect_list
as below
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
("Detroit", "George")
)
sc.parallelize(test)
.map(row => Monkey(row._1, row._2))
.toDS()
.groupBy("city")
.agg(collect_list("firstName") as "list")
.show(false)
You will have output as
+-----------+------------------------+
|city |list |
+-----------+------------------------+
|Los Angeles|[Tom] |
|Detroit |[Michael, Peter, George]|
|Chicago |[David, Andrew] |
|Houston |[John] |
|New York |[Jack] |
+-----------+------------------------+
You can always convert back to RDD
by just calling .rdd
function
回答2:
To create a data set first define a case class outside your class as
case class Employee(city: String, name: String)
Then you can convert the list to Dataset as
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
("Detroit", "George")
).toDF("city", "name")
val data = test.as[Employee]
Or
import spark.implicits._
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
("Detroit", "George")
)
val data = test.map(r => Employee(r._1, r._2)).toDS()
Now you can groupby
and perform any aggregation as
data.groupBy("city").count().show
data.groupBy("city").agg(collect_list("name")).show
Hope this helps!
回答3:
First i would turn your RDD into a DataSet:
val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._
val testDs = test.toDS()
Here you get your col names :) Use it wise !
testDs.schema.fields.foreach(x => println(x))
In the end you only need to use a groupBy:
testDs.groupBy("City?", "Name?")
RDD-s are not really the 2.0 version way I think. If you have any question please just ask.
来源:https://stackoverflow.com/questions/44404817/how-to-use-dataset-to-groupby