I need to process a .warc file through Spark but I can\'t seem to find a straightforward way of doing so. I would prefer to use Python and to not read the whole file into an RDD
If delimiter is \n\n\n you can use textinputformat.record.delimiter
\n\n\n
textinputformat.record.delimiter
sc.newAPIHadoopFile( path , 'org.apache.hadoop.mapreduce.lib.input.TextInputFormat', 'org.apache.hadoop.io.LongWritable', 'org.apache.hadoop.io.Text', conf={'textinputformat.record.delimiter': '\n\n\n'} )