Can Apache Spark merge several similar lines into one line?

不想你离开。 提交于 2019-12-11 08:04:01

问题


I am totaly new with Apache Spark, therefore, I am very sorry if my question seems to be naive but I did not find a clear answer on internet.

Here is the context of my problem: I want to retrieve json input data from an Apache Kafka server. The format is as follows:

{"deviceName":"device1", "counter":125}
{"deviceName":"device1", "counter":125}
{"deviceName":"device2", "counter":88}
{"deviceName":"device1", "counter":125}
{"deviceName":"device2", "counter":88}
{"deviceName":"device1", "counter":125}
{"deviceName":"device3", "counter":999}
{"deviceName":"device3", "counter":999}

With Spark or Spark Streaming, i want to process this data and to get as an output the following format:

{"deviceName":"device1", "counter":125, "nbOfTimes":4}
{"deviceName":"device2", "counter":88, "nbOfTimes":2}
{"deviceName":"device3", "counter":999, "nbOfTimes":2}

So, I would like to know if what I am searching to do is possible with Spark. And if yes, can you give me some guidance about it ? I would be so thankful.

Joe


回答1:


It can be done with Spark and Spark Streaming. But let's consider the first case with a json file containing your data.

val df = sqlContext.read.format("json").load("text.json")
// df: org.apache.spark.sql.DataFrame = [counter: bigint, deviceName: string]      

df.show
// +-------+----------+
// |counter|deviceName|
// +-------+----------+
// |    125|   device1|
// |    125|   device1|
// |     88|   device2|
// |    125|   device1|
// |     88|   device2|
// |    125|   device1|
// |    999|   device3|
// |    999|   device3|
// +-------+----------+

df.groupBy("deviceName","counter").count.toDF("deviceName","counter","nbOfTimes").show
// +----------+-------+---------+                                                  
// |deviceName|counter|nbOfTimes|
// +----------+-------+---------+
// |   device1|    125|        4|
// |   device2|     88|        2|
// |   device3|    999|        2|
// +----------+-------+---------+

Obviously you can write it to any format you want later on. But I think that you get the main idea.



来源:https://stackoverflow.com/questions/38723796/can-apache-spark-merge-several-similar-lines-into-one-line

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!