Spark: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema

前端 未结 2 977
悲&欢浪女
悲&欢浪女 2021-02-10 01:43

I am creating avro RDD with following code.

 def convert2Avro(data : String ,schema : Schema)  : AvroKey[GenericRecord] = {
   var wrap         


        
相关标签:
2条回答
  • 2021-02-10 02:12

    Schema.ReocrdSchema class has not implemented serializable. So it could not transferred over the network. We can convert the schema to string and pass to method and inside the method reconstruct the schema object.

    var schemaString = schema.toString
    var avroRDD = fieldsRDD.map(x =>(convert2Avro(x, schemaString)))
    

    Inside the method reconstruct the schema:

    def convert2Avro(data : String ,schemaString : String)  : AvroKey[GenericRecord] = {
       var schema = parser.parse(schemaString)
       var wrapper = new AvroKey[GenericRecord]()
       var record = new GenericData.Record(schema)
       record.put("empname","John")
        wrapper.datum(record)
        return wrapper 
      }
    
    0 讨论(0)
  • 2021-02-10 02:23

    Another alternative (from http://aseigneurin.github.io/2016/03/04/kafka-spark-avro-producing-and-consuming-avro-messages.html) is to use static initialization.

    as they explain on the link

    we are using a static initialization block. An instance of the recordInjection object will be created per JVM, i.e. we will have one instance per Spark worker

    And since it's created fresh for each worker, there is no serialization needed.

    I prefer the static initializer, as I would worry that toString() might not contain all the information needed to construct the object (it seems to work well in this case, but serialization is not toString()'s advertised purpose). However, the disadvantage of using static is that it's not really a correct use of static (see, for example, Java: when to use static methods)

    So, whichever you prefer - since both seem to work fine, then it's probably more a matter of your preferred style.

    Update Of course, depending on your program, the most elegant solution might be to avoid the problem all together, by containing all your avro code in the worker i.e. do all the Avro processing you need to do, like writing to the Kafka topic or whatever, in "convert2Avro". Then there is no need to return these objects back into an RDD. It really depends what you are wanting the RDD for.

    0 讨论(0)
提交回复
热议问题