How can I load Avros in Spark using the schema on-board the Avro file(s)?

后端 未结 2 2022
不思量自难忘°
不思量自难忘° 2021-02-02 02:10

I am running CDH 4.4 with Spark 0.9.0 from a Cloudera parcel.

I have a bunch of Avro files that were created via Pig\'s AvroStorage UDF. I want to load these files in Sp

相关标签:
2条回答
  • 2021-02-02 02:39

    To answer my own question:

    import org.apache.spark.SparkContext
    import org.apache.spark.SparkContext._
    
    import org.apache.avro.generic.GenericRecord
    import org.apache.avro.mapred.AvroKey
    import org.apache.avro.mapred.AvroInputFormat
    import org.apache.avro.mapreduce.AvroKeyInputFormat
    import org.apache.hadoop.io.NullWritable
    import org.apache.commons.lang.StringEscapeUtils.escapeCsv
    
    import org.apache.hadoop.fs.FileSystem
    import org.apache.hadoop.fs.Path
    import org.apache.hadoop.conf.Configuration
    import java.io.BufferedInputStream
    import org.apache.avro.file.DataFileStream
    import org.apache.avro.io.DatumReader
    import org.apache.avro.file.DataFileReader
    import org.apache.avro.file.DataFileReader
    import org.apache.avro.generic.{GenericDatumReader, GenericRecord}
    import org.apache.avro.mapred.FsInput
    import org.apache.avro.Schema
    import org.apache.avro.Schema.Parser
    import org.apache.hadoop.mapred.JobConf
    import java.io.File
    import java.net.URI
    
    // spark-shell -usejavacp -classpath "*.jar"
    
    val input = "hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00016.avro"
    
    val jobConf= new JobConf(sc.hadoopConfiguration)
    val rdd = sc.hadoopFile(
      input,
      classOf[org.apache.avro.mapred.AvroInputFormat[GenericRecord]],
      classOf[org.apache.avro.mapred.AvroWrapper[GenericRecord]],
      classOf[org.apache.hadoop.io.NullWritable],
      10)
    val f1 = rdd.first
    val a = f1._1.datum
    a.get("rawLog") // Access avro fields
    
    0 讨论(0)
  • 2021-02-02 02:47

    This works for me:

    import org.apache.avro.generic.GenericRecord
    import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper}
    import org.apache.hadoop.io.NullWritable
    
    ...
    val path = "hdfs:///path/to/your/avro/folder"
    val avroRDD = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](path)
    
    0 讨论(0)
提交回复
热议问题