Getting Text fonts from a pdf file using iText

后端 未结 1 501
南方客
南方客 2021-01-25 17:05

I have been trying to extract the attributes(font, font size, color etc.) of each word in a pdf document using iText library. I could extract the text from every page but not th

相关标签:
1条回答
  • 2021-01-25 17:47

    I'm not a Java person so I can't give you working code but hopefully I can get you 95% of the way there.

    First you'll need to create a class that implements the interface com.itextpdf.text.pdf.parser.TextExtractionStrategy

    Then you can pass an instance of this class as the third parameter to:

    PdfTextExtractor.getTextFromPage(PdfReader reader, int pageNumber, TextExtractionStrategy strategy)

    One of the methods of that interface is renderText which gets called for every text block that gets processed. When it gets called a TextRenderInfo gets passed in which has a method called getFont which should give you what you're looking for. Store the contents of that in a buffer of some sort and after getTextFromPage is called you can inspect that buffer to see each font. If you want to see an example of implementing that interface lookup the code for SimpleTextExtractionStrategy online. Otherwise here's a C# version that pretty much does what you're looking for.

    0 讨论(0)
提交回复
热议问题