How to read a zip containing multiple files in Apache Spark

前端 未结 5 730
星月不相逢
星月不相逢 2020-12-06 18:41

I am having a Zipped file containing multiple text files. I want to read each of the file and build a List of RDD containining the content of each files.

val         


        
相关标签:
5条回答
  • 2020-12-06 19:13

    Here's a working version of @Atais solution (which needs enhancement by closing the streams) :

    implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
    
    def readFile(path: String,
                 minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
    
      if (path.toLowerCase.contains("zip")) {
    
        sc.binaryFiles(path, minPartitions)
          .flatMap {
            case (zipFilePath, zipContent) ⇒
              val zipInputStream = new ZipInputStream(zipContent.open())
              Stream.continually(zipInputStream.getNextEntry)
                .takeWhile(_ != null)
                .map { _ ⇒
                  scala.io.Source.fromInputStream(zipInputStream, "UTF-8").getLines.mkString("\n")
                } #::: { zipInputStream.close; Stream.empty[String] }
          }
      } else {
        sc.textFile(path, minPartitions)
      }
    }
    }
    

    Then all you have to do is the following to read a zip file :

    sc.readFile(path)
    
    0 讨论(0)
  • 2020-12-06 19:15

    If you are reading binary files use sc.binaryFiles. This will return an RDD of tuples containing the file name and a PortableDataStream. You can feed the latter into a ZipInputStream.

    0 讨论(0)
  • 2020-12-06 19:19

    Apache Spark default compression support

    I have written all the necessary theory in other answer, that you might want to refer to: https://stackoverflow.com/a/45958182/1549135

    Read zip containing multiple files

    I have followed the advice given by @Herman and used ZipInputStream. This gave me this solution, which returns RDD[String] of the zip content.

    import java.io.{BufferedReader, InputStreamReader}
    import java.util.zip.ZipInputStream
    import org.apache.spark.SparkContext
    import org.apache.spark.input.PortableDataStream
    import org.apache.spark.rdd.RDD
    
    implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
    
        def readFile(path: String,
                     minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
    
          if (path.endsWith(".zip")) {
            sc.binaryFiles(path, minPartitions)
              .flatMap { case (name: String, content: PortableDataStream) =>
                val zis = new ZipInputStream(content.open)
                Stream.continually(zis.getNextEntry)
                      .takeWhile {
                          case null => zis.close(); false
                          case _ => true
                      }
                      .flatMap { _ =>
                          val br = new BufferedReader(new InputStreamReader(zis))
                          Stream.continually(br.readLine()).takeWhile(_ != null)
                      }
            }
          } else {
            sc.textFile(path, minPartitions)
          }
        }
      }
    

    simply use it by importing the implicit class and call the readFile method on SparkContext:

    import com.github.atais.spark.Implicits.ZipSparkContext
    sc.readFile(path)
    
    0 讨论(0)
  • 2020-12-06 19:35

    This filters only the first line. can anyone share your insights. I am trying to read a CSV file which is zipped and create JavaRDD for further processing.

    JavaPairRDD<String, PortableDataStream> zipData =
                    sc.binaryFiles("hdfs://temp.zip");
            JavaRDD<Record> newRDDRecord = zipData.flatMap(
              new FlatMapFunction<Tuple2<String, PortableDataStream>, Record>(){
                  public Iterator<Record> call(Tuple2<String,PortableDataStream> content) throws Exception {
                      List<Record> records = new ArrayList<Record>();
                          ZipInputStream zin = new ZipInputStream(content._2.open());
                          ZipEntry zipEntry;
                          while ((zipEntry = zin.getNextEntry()) != null) {
                              count++;
                              if (!zipEntry.isDirectory()) {
                                  Record sd;
                                  String line;
                                  InputStreamReader streamReader = new InputStreamReader(zin);
                                  BufferedReader bufferedReader = new BufferedReader(streamReader);
                                  line = bufferedReader.readLine();
                                  String[] records= new CSVParser().parseLineMulti(line);
                                  sd = new Record(TimeBuilder.convertStringToTimestamp(records[0]),
                                            getDefaultValue(records[1]),
                                            getDefaultValue(records[22]));
                                  records.add(sd);
                              }
                          }
    
                    return records.iterator();
                  }
    
            });
    
    0 讨论(0)
  • 2020-12-06 19:36

    Here is another working solution which gives out file name which can be later split and used to create separate schemas from it.

    implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
    
        def readFile(path: String,
                     minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
    
          if (path.toLowerCase.contains("zip")) {
    
            sc.binaryFiles(path, minPartitions)
              .flatMap {
                case (zipFilePath, zipContent) ⇒
                  val zipInputStream = new ZipInputStream(zipContent.open())
                  Stream.continually(zipInputStream.getNextEntry)
                    .takeWhile(_ != null)
                    .map { x ⇒
                      val filename1 = x.getName
                      scala.io.Source.fromInputStream(zipInputStream, "UTF-8").getLines.mkString(s"~${filename1}\n")+s"~${filename1}"
                    } #::: { zipInputStream.close; Stream.empty[String] }
              }
          } else {
            sc.textFile(path, minPartitions)
          }
        }
      }

    full code is here

    https://github.com/kali786516/Spark2StructuredStreaming/blob/master/src/main/scala/com/dataframe/extraDFExamples/SparkReadZipFiles.scala

    0 讨论(0)
提交回复
热议问题