How to extract font styles of text contents using pdfbox?

前端 未结 3 599
青春惊慌失措
青春惊慌失措 2020-12-10 18:03

I am using pdfbox library to extract text contents from pdf file.I would able to extract all the text,but couldn\'t find the method to extract font styles.

相关标签:
3条回答
  • 2020-12-10 18:44
    File file = new File("sample.pdf");
            PDDocument document = PDDocument.load(file);
    
            for (int i = 0; i < document.getNumberOfPages(); ++i)
            {
                PDPage page = document.getPage(i);
                PDResources res = page.getResources();
                for (COSName fontName : res.getFontNames())
                {
                    PDFont font = res.getFont(fontName);
                    System.out.println(font.getName());
    
                }
            }
    
    0 讨论(0)
  • 2020-12-10 18:47
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.util.PDFTextStripper;
    public class pdf2box {
        public static void main(String args[])
        {
            try
            {
        PDDocument pddDocument=PDDocument.load("table2.pdf");
        PDFTextStripper textStripper=new PDFTextStripper();
        System.out.println(textStripper.getText(pddDocument));
        textStripper.getFonts();
    
    
    
        pddDocument.close();
            }
            catch(Exception ex)
            {
            ex.printStackTrace();
            }
        }
    
    
    }
    
    0 讨论(0)
  • 2020-12-10 18:58

    This is not the right way to extract font. To read font one has to iterate through pdf pages and extract font as below:

    PDDocument  doc = PDDocument.load("C:/mydoc3.pdf");
    List<PDPage> pages = doc.getDocumentCatalog().getAllPages();
    for(PDPage page:pages){
        Map<String,PDFont> pageFonts=page.getResources().getFonts();
    }
    
    0 讨论(0)
提交回复
热议问题