Spark reading WARC file with custom InputFormat

后端 未结 1 996
旧时难觅i
旧时难觅i 2021-01-22 04:33

I need to process a .warc file through Spark but I can\'t seem to find a straightforward way of doing so. I would prefer to use Python and to not read the whole file into an RDD

1条回答
  •  清歌不尽
    2021-01-22 04:40

    If delimiter is \n\n\n you can use textinputformat.record.delimiter

    sc.newAPIHadoopFile(
      path ,
      'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
      'org.apache.hadoop.io.LongWritable',
      'org.apache.hadoop.io.Text',
      conf={'textinputformat.record.delimiter': '\n\n\n'}
    )
    

    0 讨论(0)
提交回复
热议问题