Spark reading WARC file with custom InputFormat

后端未结

关注

 1  996

旧时难觅i 2021-01-22 04:33

I need to process a .warc file through Spark but I can\'t seem to find a straightforward way of doing so. I would prefer to use Python and to not read the whole file into an RDD

1条回答

清歌不尽 (楼主)

2021-01-22 04:40

If delimiter is \n\n\n you can use textinputformat.record.delimiter

sc.newAPIHadoopFile(
  path ,
  'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
  'org.apache.hadoop.io.LongWritable',
  'org.apache.hadoop.io.Text',
  conf={'textinputformat.record.delimiter': '\n\n\n'}
)

0 讨论(0)