How to remove headers and footers from PDF file using iText in Java

折月煮酒 提交于 2021-02-09 05:37:44

问题


I am using the PDF iText library to convert PDF to text.

Below is my code to convert PDF to text file using Java.

public class PdfConverter {

/** The original PDF that will be parsed. */
public static final String pdfFileName = "jdbc_tutorial.pdf";
/** The resulting text file. */
public static final String RESULT = "preface.txt";

/**
 * Parses a PDF to a plain text file.
 * @param pdf the original PDF
 * @param txt the resulting text
 * @throws IOException
 */
public void parsePdf(String pdf, String txt) throws IOException {
    PdfReader reader = new PdfReader(pdf);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    PrintWriter out = new PrintWriter(new FileOutputStream(txt));

    TextExtractionStrategy strategy;
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
        out.println(strategy.getResultantText());
        System.out.println(strategy.getResultantText());
    }
    out.flush();
    out.close();
    reader.close();
}

/**
 * Main method.
 * @param    args    no arguments needed
 * @throws IOException
 */
public static void main(String[] args) throws IOException {
    new PdfConverter().parsePdf(pdfFileName, RESULT);
}
}

The above code works for extracting PDF to text. But my requirement is to ignore header and footer and extract only content from PDF file.


回答1:


Because your pdf has headers and footers, it would be marked as artifacts(if not its just a text or content placed at the position of a header or footer). If its marked as artifacts, you can extract it using ParseTaggedPdf. You can also make use of ExtractPageContentArea if ParseTaggedPdf doesn't work. You can check for a few examples related to it.

The above solution is general and depends on the file. If you really need an alternate solution, you can use apache API's like PdfBox, tika and others like PDFTextStream. The solution which i'm giving below wont work if you have to persist with iText and can't move on to other libraries. In PdfBox you can use PDFTextStripperByArea or PDFTextStripper. Look at the JavaDoc or some examples if you need to know how to use it.




回答2:


Using IText I found one example in this site http://what-when-how.com/itext-5/parsing-pdfs-part-2-itext-5/

In this you create a rectangle that defines the bounds of the text you are getting.

PdfReader reader = new PdfReader(pdf);
PrintWriter out= new PrintWriter(new FileOutputStream(txt));
//Creating the rectangle
Rectangle rect=new Rectangle(70,80,420,500);
//creating a filter based on the rectangle
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy;
for(int i=1;i<=reader.getNumberOfPages();i+){
    //setting the filter on the text extraction strategy
    strategy= new FilteredTextRenderListener(
      new LocationTextExtractionStrategy(),filter);
    out.println(PdfTextExtractor.getTextFromPage(reader,i,strategy));
}
out.flush();out.close();

as the web page describes this, It should work even if the pdf is not tagged.




回答3:


You can read specific locations of a pdf file. Just mark those areas that you need to get text from and leave the areas where the header and footer are shown. I have done it and here is the complete code. itext reading specific location from pdf file runs in intellij and gives desired output but executable jar throws error



来源:https://stackoverflow.com/questions/27884283/how-to-remove-headers-and-footers-from-pdf-file-using-itext-in-java

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!