How to read a zip containing multiple files in Apache Spark

前端未结

关注

 5  730

I am having a Zipped file containing multiple text files. I want to read each of the file and build a List of RDD containining the content of each files.

val


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  独厮守ぢ        
                
              
                            
                2020-12-06 19:13
              
            
            
                                                                       
Here's a working version of @Atais solution (which needs enhancement by closing the streams) : 

implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {

def readFile(path: String,
             minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {

  if (path.toLowerCase.contains("zip")) {

    sc.binaryFiles(path, minPartitions)
      .flatMap {
        case (zipFilePath, zipContent) ⇒
          val zipInputStream = new ZipInputStream(zipContent.open())
          Stream.continually(zipInputStream.getNextEntry)
            .takeWhile(_ != null)
            .map { _ ⇒
              scala.io.Source.fromInputStream(zipInputStream, "UTF-8").getLines.mkString("\n")
            } #::: { zipInputStream.close; Stream.empty[String] }
      }
  } else {
    sc.textFile(path, minPartitions)
  }
}
}


Then all you have to do is the following to read a zip file : 

sc.readFile(path)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  难免孤独        
                
              
                            
                2020-12-06 19:15
              
            
            
                                                                       
If you are reading binary files use sc.binaryFiles. This will return an RDD of tuples containing the file name and a PortableDataStream. You can feed the latter into a ZipInputStream.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  野性不改        
                
              
                            
                2020-12-06 19:19
              
            
            
                                                                       
Apache Spark default compression support
I have written all the necessary theory in other answer, that you might want to refer to: https://stackoverflow.com/a/45958182/1549135
Read zip containing multiple files
I have followed the advice given by @Herman and used ZipInputStream. This gave me this solution, which returns RDD[String] of the zip content.
import java.io.{BufferedReader, InputStreamReader}
import java.util.zip.ZipInputStream
import org.apache.spark.SparkContext
import org.apache.spark.input.PortableDataStream
import org.apache.spark.rdd.RDD

implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {

    def readFile(path: String,
                 minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {

      if (path.endsWith(".zip")) {
        sc.binaryFiles(path, minPartitions)
          .flatMap { case (name: String, content: PortableDataStream) =>
            val zis = new ZipInputStream(content.open)
            Stream.continually(zis.getNextEntry)
                  .takeWhile {
                      case null => zis.close(); false
                      case _ => true
                  }
                  .flatMap { _ =>
                      val br = new BufferedReader(new InputStreamReader(zis))
                      Stream.continually(br.readLine()).takeWhile(_ != null)
                  }
        }
      } else {
        sc.textFile(path, minPartitions)
      }
    }
  }

simply use it by importing the implicit class and call the readFile method on SparkContext:
import com.github.atais.spark.Implicits.ZipSparkContext
sc.readFile(path)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  囚心锁ツ        
                
              
                            
                2020-12-06 19:35
              
            
            
                                                                       
This filters only the first line. can anyone share your insights. I am trying to read a CSV file which is zipped and create JavaRDD for further processing.

JavaPairRDD<String, PortableDataStream> zipData =
                sc.binaryFiles("hdfs://temp.zip");
        JavaRDD<Record> newRDDRecord = zipData.flatMap(
          new FlatMapFunction<Tuple2<String, PortableDataStream>, Record>(){
              public Iterator<Record> call(Tuple2<String,PortableDataStream> content) throws Exception {
                  List<Record> records = new ArrayList<Record>();
                      ZipInputStream zin = new ZipInputStream(content._2.open());
                      ZipEntry zipEntry;
                      while ((zipEntry = zin.getNextEntry()) != null) {
                          count++;
                          if (!zipEntry.isDirectory()) {
                              Record sd;
                              String line;
                              InputStreamReader streamReader = new InputStreamReader(zin);
                              BufferedReader bufferedReader = new BufferedReader(streamReader);
                              line = bufferedReader.readLine();
                              String[] records= new CSVParser().parseLineMulti(line);
                              sd = new Record(TimeBuilder.convertStringToTimestamp(records[0]),
                                        getDefaultValue(records[1]),
                                        getDefaultValue(records[22]));
                              records.add(sd);
                          }
                      }

                return records.iterator();
              }

        });

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情话喂你        
                
              
                            
                2020-12-06 19:36
              
            
            
                                                                       
Here is another working solution which gives out file name which can be later split and used to create separate schemas from it.



implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {

    def readFile(path: String,
                 minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {

      if (path.toLowerCase.contains("zip")) {

        sc.binaryFiles(path, minPartitions)
          .flatMap {
            case (zipFilePath, zipContent) ⇒
              val zipInputStream = new ZipInputStream(zipContent.open())
              Stream.continually(zipInputStream.getNextEntry)
                .takeWhile(_ != null)
                .map { x ⇒
                  val filename1 = x.getName
                  scala.io.Source.fromInputStream(zipInputStream, "UTF-8").getLines.mkString(s"~${filename1}\n")+s"~${filename1}"
                } #::: { zipInputStream.close; Stream.empty[String] }
          }
      } else {
        sc.textFile(path, minPartitions)
      }
    }
  }




full code is here 

https://github.com/kali786516/Spark2StructuredStreaming/blob/master/src/main/scala/com/dataframe/extraDFExamples/SparkReadZipFiles.scala
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复