Zip support in Apache Spark

后端 未结 5 1936
时光取名叫无心
时光取名叫无心 2020-12-03 15:02

I have read about Spark\'s support for gzip-kind input files here, and I wonder if the same support exists for different kind of compressed files, such as

相关标签:
5条回答
  • 2020-12-03 15:25

    You can use sc.binaryFiles to read Zip as binary file

    val rdd = sc.binaryFiles(path).flatMap { 
        case (name: String, content: PortableDataStream) => new ZipInputStream(content.open) 
    }  //=> RDD[ZipInputStream]
    

    And then you can map the ZipInputStream to list of lines:

    val zis = rdd.first
    val entry = zis.getNextEntry
    val br = new BufferedReader(new InputStreamReader(in, "UTF-8"))
    val res = Stream.continually(br.readLine()).takeWhile(_ != null).toList
    

    But the problem remains that the zip file is not splittable.

    0 讨论(0)
  • 2020-12-03 15:27

    Since Apache Spark uses Hadoop Input formats we can look at the hadoop documentation on how to process zip files and see if there is something that works.

    This site gives us an idea of how to use this (namely we can use the ZipFileInputFormat). That being said, since zip files are not split-table (see this) your request to have a single compressed file isn't really well supported. Instead, if possible, it would be better to have a directory containing many separate zip files.

    This question is similar to this other question, however it adds an additional question of if it would be possible to have a single zip file (which, since it isn't a split-table format isn't a good idea).

    0 讨论(0)
  • 2020-12-03 15:29

    Spark default support compressed files

    According to Spark Programming Guide

    All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").

    This could be expanded by providing information about what compression formats are supported by Hadoop, which basically can be checked by finding all classes extending CompressionCodec (docs)

    name    | ext      | codec class
    -------------------------------------------------------------
    bzip2   | .bz2     | org.apache.hadoop.io.compress.BZip2Codec 
    default | .deflate | org.apache.hadoop.io.compress.DefaultCodec 
    deflate | .deflate | org.apache.hadoop.io.compress.DeflateCodec 
    gzip    | .gz      | org.apache.hadoop.io.compress.GzipCodec 
    lz4     | .lz4     | org.apache.hadoop.io.compress.Lz4Codec 
    snappy  | .snappy  | org.apache.hadoop.io.compress.SnappyCodec
    

    Source : List the available hadoop codecs

    So the above formats and much more possibilities could be achieved simply by calling:

    sc.readFile(path)
    

    Reading zip files in Spark

    Unfortunately, zip is not on the supported list by default.

    I have found a great article: Hadoop: Processing ZIP files in Map/Reduce and some answers (example) explaining how to use imported ZipFileInputFormat together with sc.newAPIHadoopFile API. But this did not work for me.

    My solution

    Without any external dependencies, you can load your file with sc.binaryFiles and later on decompress the PortableDataStream reading the content. This is the approach I have chosen.

    import java.io.{BufferedReader, InputStreamReader}
    import java.util.zip.ZipInputStream
    import org.apache.spark.SparkContext
    import org.apache.spark.input.PortableDataStream
    import org.apache.spark.rdd.RDD
    
    implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
    
        def readFile(path: String,
                     minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
    
          if (path.endsWith(".zip")) {
            sc.binaryFiles(path, minPartitions)
              .flatMap { case (name: String, content: PortableDataStream) =>
                val zis = new ZipInputStream(content.open)
                // this solution works only for single file in the zip
                val entry = zis.getNextEntry
                val br = new BufferedReader(new InputStreamReader(zis))
                Stream.continually(br.readLine()).takeWhile(_ != null)
              }
          } else {
            sc.textFile(path, minPartitions)
          }
        }
      }
    

    using this implicit class, you need to import it and call the readFile method on SparkContext:

    import com.github.atais.spark.Implicits.ZipSparkContext
    sc.readFile(path)
    

    And the implicit class will load your zip file properly and return RDD[String] like it used to.

    Note: This only works for single file in the zip archive!
    For multiple files in your zip support, check this answer: https://stackoverflow.com/a/45958458/1549135

    0 讨论(0)
  • 2020-12-03 15:29

    Below is an example which searches a directory for .zip files and create an RDD using a custom FileInputFormat called ZipFileInputFormat and the newAPIHadoopFile API on the Spark Context. It then writes those files to an output directory.

    allzip.foreach { x =>
      val zipFileRDD = sc.newAPIHadoopFile(
        x.getPath.toString,
        classOf[ZipFileInputFormat],
        classOf[Text],
        classOf[BytesWritable], hadoopConf)
    
      zipFileRDD.foreach { y =>
        ProcessFile(y._1.toString, y._2)
      }
    

    https://github.com/alvinhenrick/apache-spark-examples/blob/master/src/main/scala/com/zip/example/Unzip.scala

    The ZipFileInputFormat used in the example can be found here: https://github.com/cotdp/com-cotdp-hadoop/tree/master/src/main/java/com/cotdp/hadoop

    0 讨论(0)
  • 2020-12-03 15:44

    You can use sc.binaryFiles to open the zip file in binary format, then unzip it into the text format. Unfortunately, the zip file is not split-able.. So you need to wait for the decompression, then maybe call shuffle to balance the data in each partition.

    Here is an example in Python. More info is in http://gregwiki.duckdns.org/index.php/2016/04/11/read-zip-file-in-spark/

     file_RDD = sc.binaryFiles( HDFS_path + data_path )
    
     def Zip_open( binary_stream_string ) : # New version, treat a stream as zipped file
         try :
             pseudo_file = io.BytesIO( binary_stream_string )
             zf = zipfile.ZipFile( pseudo_file )
             return zf
         except :
             return None
    
     def read_zip_lines(zipfile_object) :
         file_iter = zipfile_object.open('diff.txt')
         data =  file_iter.readlines() 
         return data
    
     My_RDD = file_RDD.map(lambda kv: (kv[0], Zip_open(kv[1])))
    
    0 讨论(0)
提交回复
热议问题