How to read PDF files and xml files in Apache Spark scala?

后端 未结 3 2098
有刺的猬
有刺的猬 2020-12-19 21:35

My sample code for reading text file is

val text = sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitio         


        
相关标签:
3条回答
  • 2020-12-19 21:45

    You can simply use spark-shell with tika and run the below code in a sequential manner or in a distributed manner depending upon your use case

    spark-shell --jars tika-app-1.8.jar
    val binRDD = sc.binaryFiles("/data/")
    val textRDD = binRDD.map(file => {new org.apache.tika.Tika().parseToString(file._2.open( ))})
    textRDD.saveAsTextFile("/output/")
    System.exit(0)
    
    0 讨论(0)
  • 2020-12-19 21:50

    PDF can be parse in pyspark as follow:

    If PDF is store in HDFS then using sc.binaryFiles() as PDF is store in binary format. Then the binary content can be send to pdfminer for parsing.

    import pdfminer
    from pdfminer.pdfparser import PDFParser
    from pdfminer.pdfdocument import PDFDocument
    
    def return_device_content(cont):
        fp = io.BytesIO(cont)
        parser = PDFParser(fp)
        document = PDFDocument(parser)
    
    filesPath="/user/root/*.pdf"
    fileData = sc.binaryFiles(filesPath)
    file_content = fileData.map(lambda content : content[1])
    file_content1 = file_content.map(return_device_content)
    

    Further parsing is can be done using functionality provided by pdfminer.

    0 讨论(0)
  • 2020-12-19 22:02

    PDF & XML can be parsed using Tika:

    look at Apache Tika - a content analysis toolkit look at - https://tika.apache.org/1.9/api/org/apache/tika/parser/xml/
    - http://tika.apache.org/0.7/api/org/apache/tika/parser/pdf/PDFParser.html
    - https://tika.apache.org/1.9/api/org/apache/tika/parser/AutoDetectParser.html
    Below is example integration of Spark with Tika :

    import org.apache.spark.SparkContext
    import org.apache.spark.SparkContext._
    import org.apache.spark.SparkConf
    import org.apache.spark.input.PortableDataStream
    import org.apache.tika.metadata._
    import org.apache.tika.parser._
    import org.apache.tika.sax.WriteOutContentHandler
    import java.io._
    
    object TikaFileParser {
    
      def tikaFunc (a: (String, PortableDataStream)) = {
    
        val file : File = new File(a._1.drop(5))
        val myparser : AutoDetectParser = new AutoDetectParser()
        val stream : InputStream = new FileInputStream(file)
        val handler : WriteOutContentHandler = new WriteOutContentHandler(-1)
        val metadata : Metadata = new Metadata()
        val context : ParseContext = new ParseContext()
    
        myparser.parse(stream, handler, metadata, context)
    
        stream.close
    
        println(handler.toString())
        println("------------------------------------------------")
      }
    
    
      def main(args: Array[String]) {
    
        val filesPath = "/home/user/documents/*"
        val conf = new SparkConf().setAppName("TikaFileParser")
        val sc = new SparkContext(conf)
        val fileData = sc.binaryFiles(filesPath)
        fileData.foreach( x => tikaFunc(x))
      }
    }
    
    0 讨论(0)
提交回复
热议问题