How to read PDF files and xml files in Apache Spark scala?

后端未结

关注

 3  2098

My sample code for reading text file is

val text = sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitio


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  盖世英雄少女心        
                
              
                            
                2020-12-19 21:45
              
            
            
                                                                       
You can simply use spark-shell with tika and run the below code in a sequential manner or in a distributed manner depending upon your use case

spark-shell --jars tika-app-1.8.jar
val binRDD = sc.binaryFiles("/data/")
val textRDD = binRDD.map(file => {new org.apache.tika.Tika().parseToString(file._2.open( ))})
textRDD.saveAsTextFile("/output/")
System.exit(0)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感动是毒        
                
              
                            
                2020-12-19 21:50
              
            
            
                                                                       
PDF can be parse in pyspark as follow:

If PDF is store in HDFS then using sc.binaryFiles() as PDF is store in binary format.
Then the binary content can be send to pdfminer for parsing. 

import pdfminer
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

def return_device_content(cont):
    fp = io.BytesIO(cont)
    parser = PDFParser(fp)
    document = PDFDocument(parser)

filesPath="/user/root/*.pdf"
fileData = sc.binaryFiles(filesPath)
file_content = fileData.map(lambda content : content[1])
file_content1 = file_content.map(return_device_content)


Further parsing is can be done using functionality provided by pdfminer. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  名媛妹妹        
                
              
                            
                2020-12-19 22:02
              
            
            
                                                                       
PDF & XML can be parsed using Tika: 

look at Apache Tika - a content analysis toolkit

look at 
- https://tika.apache.org/1.9/api/org/apache/tika/parser/xml/

- http://tika.apache.org/0.7/api/org/apache/tika/parser/pdf/PDFParser.html

- https://tika.apache.org/1.9/api/org/apache/tika/parser/AutoDetectParser.html

Below is example integration of Spark with Tika : 

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.input.PortableDataStream
import org.apache.tika.metadata._
import org.apache.tika.parser._
import org.apache.tika.sax.WriteOutContentHandler
import java.io._

object TikaFileParser {

  def tikaFunc (a: (String, PortableDataStream)) = {

    val file : File = new File(a._1.drop(5))
    val myparser : AutoDetectParser = new AutoDetectParser()
    val stream : InputStream = new FileInputStream(file)
    val handler : WriteOutContentHandler = new WriteOutContentHandler(-1)
    val metadata : Metadata = new Metadata()
    val context : ParseContext = new ParseContext()

    myparser.parse(stream, handler, metadata, context)

    stream.close

    println(handler.toString())
    println("------------------------------------------------")
  }


  def main(args: Array[String]) {

    val filesPath = "/home/user/documents/*"
    val conf = new SparkConf().setAppName("TikaFileParser")
    val sc = new SparkContext(conf)
    val fileData = sc.binaryFiles(filesPath)
    fileData.foreach( x => tikaFunc(x))
  }
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复