How to parse PDF files in map reduce programs?

后端 未结 2 1174
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-16 03:49

I want to parse PDF files in my hadoop 2.2.0 program and I found this, followed what it says and until now, I have these three classes:

  1. PDFWordCo
2条回答
  •  暖寄归人
    2021-01-16 04:33

    Reading PDFs is not that difficult, you need to extend the class FileInputFormat as well as the RecordReader. The FileInputClass should not be able to split PDF files since they are binaries.

    public class PDFInputFormat extends FileInputFormat {
    
      @Override
      public RecordReader createRecordReader(InputSplit split,
        TaskAttemptContext context) throws IOException, InterruptedException {
          return new PDFLineRecordReader();
      }
    
      // Do not allow to ever split PDF files, even if larger than HDFS block size
      @Override
      protected boolean isSplitable(JobContext context, Path filename) {
        return false;
      }
    
    }
    

    The RecordReader then performs the reading itself (I am using PDFBox to read PDFs).

    public class PDFLineRecordReader extends RecordReader {
    
    private Text key = new Text();
    private Text value = new Text();
    private int currentLine = 0;
    private List lines = null;
    
    private PDDocument doc = null;
    private PDFTextStripper textStripper = null;
    
    @Override
    public void initialize(InputSplit split, TaskAttemptContext context)
            throws IOException, InterruptedException {
    
        FileSplit fileSplit = (FileSplit) split;
        final Path file = fileSplit.getPath();
    
        Configuration conf = context.getConfiguration();
        FileSystem fs = file.getFileSystem(conf);
        FSDataInputStream filein = fs.open(fileSplit.getPath());
    
        if (filein != null) {
    
            doc = PDDocument.load(filein);
    
            // Konnte das PDF gelesen werden?
            if (doc != null) {
                textStripper = new PDFTextStripper();
                String text = textStripper.getText(doc);
    
                lines = Arrays.asList(text.split(System.lineSeparator()));
                currentLine = 0;
    
            }
    
        }
    }
    
        // False ends the reading process
    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
    
        if (key == null) {
            key = new Text();
        }
    
        if (value == null) {
            value = new Text();
        }
    
        if (currentLine < lines.size()) {
            String line = lines.get(currentLine);
    
            key.set(line);
    
            value.set("");
            currentLine++;
    
            return true;
        } else {
    
            // All lines are read? -> end
            key = null;
            value = null;
            return false;
        }
    }
    
    @Override
    public Text getCurrentKey() throws IOException, InterruptedException {
        return key;
    }
    
    @Override
    public Text getCurrentValue() throws IOException, InterruptedException {
        return value;
    }
    
    @Override
    public float getProgress() throws IOException, InterruptedException {
        return (100.0f / lines.size() * currentLine) / 100.0f;
    }
    
    @Override
    public void close() throws IOException {
    
        // If done close the doc
        if (doc != null) {
            doc.close();
        }
    
    }
    

    Hope this helps!

提交回复
热议问题