Read external json file into RDD and extract specific values in scala

前端 未结 2 1411
被撕碎了的回忆
被撕碎了的回忆 2021-01-14 20:46

Firstly, I am completely new to scala and spark Although bit famailiar with pyspark. I am working with external json file which is pretty huge and I am not allowed to conver

相关标签:
2条回答
  • 2021-01-14 21:13

    Option 1: RDD API + json4s lib

    One way is using the json4s library. The library is already used internally by Spark.

    import org.json4s._
    import org.json4s.jackson.JsonMethods._
    
    // {"name":"ABC1", "roll_no":"12", "Major":"CS1"}
    // {"name":"ABC2", "roll_no":"13", "Major":"CS2"}
    // {"name":"ABC3", "roll_no":"14", "Major":"CS3"}
    val file_location = "information.json"
    
    val rdd = sc.textFile(file_location)
    
    rdd.map{ row =>
      val json_row = parse(row)
    
      (compact(json_row \ "name"), compact(json_row \ "roll_no"))
    }.collect().foreach{println _}
    
    // Output
    // ("ABC1","12")
    // ("ABC2","13")
    // ("ABC3","14")
    
    

    First we parse the row data into json_row then we access the properties of the row with the operator \ i.e: json_row \ "name". The final result is a sequence of tuples of name,roll_no

    Option 2: dataframe API + get_json_object()

    And a more straight forward approach would be via the dataframe API in combination with the get_json_object() function.

    import org.apache.spark.sql.functions.get_json_object
    
    val df = spark.read.text(file_location)
    
    df.select(
      get_json_object($"value","$.name").as("name"),
      get_json_object($"value","$.roll_no").as("roll_no"))
    .collect()
    .foreach{println _}
    
    // [ABC1,12]
    // [ABC2,13]
    // [ABC3,14]
    
    0 讨论(0)
  • 2021-01-14 21:26

    i used to parse json in scala with this kind of method :

     /** ---------------------------------------
        * Example of method to parse simple json
            {
            "fields": [
              {
                "field1": "value",
                "field2": "value",
                "field3": "value"
              }
            ]
          }*/
    
    import scala.io.Source
    import scala.util.parsing.json._
    
      case class outputData(field1 : String, field2: String, field3 : String)
    
      def singleMapJsonParser(JsonDataFile : String) : List[outputData] = {
    
        val JsonData : String = Source.fromFile(JsonDataFile).getLines.mkString
    
        val jsonFormatData = JSON.parseFull(JsonData).map{
          case json : Map[String, List[Map[String,String]]] =>
            json("fields").map(v => outputData(v("field1"),v("field2"),v("field3")))
        }.get
    
        jsonFormatData
      }
    

    Then you just have to call your sparkContext to transform le List[Class] output to RDD

    0 讨论(0)
提交回复
热议问题