Read external json file into RDD and extract specific values in scala

前端未结

关注

 2  1409

被撕碎了的回忆 2021-01-14 20:46

Firstly, I am completely new to scala and spark Although bit famailiar with pyspark. I am working with external json file which is pretty huge and I am not allowed to conver

2条回答

隐瞒了意图╮ (楼主)

2021-01-14 21:13

Option 1: RDD API + json4s lib

One way is using the json4s library. The library is already used internally by Spark.

import org.json4s._
import org.json4s.jackson.JsonMethods._

// {"name":"ABC1", "roll_no":"12", "Major":"CS1"}
// {"name":"ABC2", "roll_no":"13", "Major":"CS2"}
// {"name":"ABC3", "roll_no":"14", "Major":"CS3"}
val file_location = "information.json"

val rdd = sc.textFile(file_location)

rdd.map{ row =>
  val json_row = parse(row)

  (compact(json_row \ "name"), compact(json_row \ "roll_no"))
}.collect().foreach{println _}

// Output
// ("ABC1","12")
// ("ABC2","13")
// ("ABC3","14")

First we parse the row data into json_row then we access the properties of the row with the operator \ i.e: json_row \ "name". The final result is a sequence of tuples of name,roll_no

Option 2: dataframe API + get_json_object()

And a more straight forward approach would be via the dataframe API in combination with the get_json_object() function.

import org.apache.spark.sql.functions.get_json_object

val df = spark.read.text(file_location)

df.select(
  get_json_object($"value","$.name").as("name"),
  get_json_object($"value","$.roll_no").as("roll_no"))
.collect()
.foreach{println _}

// [ABC1,12]
// [ABC2,13]
// [ABC3,14]

0 讨论(0)

查看其它2个回答