Firstly, I am completely new to scala and spark Although bit famailiar with pyspark. I am working with external json file which is pretty huge and I am not allowed to conver
Option 1: RDD API + json4s lib
One way is using the json4s library. The library is already used internally by Spark.
import org.json4s._
import org.json4s.jackson.JsonMethods._
// {"name":"ABC1", "roll_no":"12", "Major":"CS1"}
// {"name":"ABC2", "roll_no":"13", "Major":"CS2"}
// {"name":"ABC3", "roll_no":"14", "Major":"CS3"}
val file_location = "information.json"
val rdd = sc.textFile(file_location)
rdd.map{ row =>
val json_row = parse(row)
(compact(json_row \ "name"), compact(json_row \ "roll_no"))
}.collect().foreach{println _}
// Output
// ("ABC1","12")
// ("ABC2","13")
// ("ABC3","14")
First we parse the row data into json_row then we access the properties of the row with the operator \
i.e: json_row \ "name"
. The final result is a sequence of tuples of name,roll_no
Option 2: dataframe API + get_json_object()
And a more straight forward approach would be via the dataframe API in combination with the get_json_object()
function.
import org.apache.spark.sql.functions.get_json_object
val df = spark.read.text(file_location)
df.select(
get_json_object($"value","$.name").as("name"),
get_json_object($"value","$.roll_no").as("roll_no"))
.collect()
.foreach{println _}
// [ABC1,12]
// [ABC2,13]
// [ABC3,14]
i used to parse json in scala with this kind of method :
/** ---------------------------------------
* Example of method to parse simple json
{
"fields": [
{
"field1": "value",
"field2": "value",
"field3": "value"
}
]
}*/
import scala.io.Source
import scala.util.parsing.json._
case class outputData(field1 : String, field2: String, field3 : String)
def singleMapJsonParser(JsonDataFile : String) : List[outputData] = {
val JsonData : String = Source.fromFile(JsonDataFile).getLines.mkString
val jsonFormatData = JSON.parseFull(JsonData).map{
case json : Map[String, List[Map[String,String]]] =>
json("fields").map(v => outputData(v("field1"),v("field2"),v("field3")))
}.get
jsonFormatData
}
Then you just have to call your sparkContext to transform le List[Class] output to RDD