问题
I'm currently trying to automatically extract important keywords from a PDF file. I am able to get the text information out of the PDF document. But now I need to know, which font size and font family these keywords have.
The following code I already have:
Main
public static void main(String[] args) throws IOException {
String src = "SEM_081145.pdf";
PdfReader reader = new PdfReader(src);
SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy();
PrintWriter out = new PrintWriter(new FileOutputStream(src + ".txt"));
Rectangle rect = new Rectangle(70, 80, 490, 580);
RenderFilter filter = new RegionTextRenderFilter(rect);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
// strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
out.println(PdfTextExtractor.getTextFromPage(reader, i, semTextExtractionStrategy));
}
out.flush();
out.close();
}
And I have implemented the TextExtraction Strategy SemTextExtractionStrategy
which looks like this:
public class SemTextExtractionStrategy implements TextExtractionStrategy {
private String text;
@Override
public void beginTextBlock() {
}
@Override
public void renderText(TextRenderInfo renderInfo) {
text = renderInfo.getText();
System.out.println(renderInfo.getFont().getFontType());
System.out.print(text);
}
@Override
public void endTextBlock() {
}
@Override
public void renderImage(ImageRenderInfo renderInfo) {
}
@Override
public String getResultantText() {
return text;
}
}
I can get the FontType but there is no method to get the font size. Is there another way or how can I get the font size of the current text segment?
Or are there any other libraries which can fetch out the font size from TextSegments? I already had a look into PDFBox, and PDFTextStream. The PDF Shareware Library from Aspose would perfectly do the job. But it's very expensive and I need to use an open source project.
回答1:
You can adapt the code provided in this answer, in particular this code snippet:
Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Single curFontSize = rect.Height;
This answer is in C#, but the API is so similar that the conversion to Java should be straightforward.
回答2:
Thanks to Alexis I could convert his C# solution into Java code:
text = renderInfo.getText();
Vector curBaseline = renderInfo.getBaseline().getStartPoint();
Vector topRight = renderInfo.getAscentLine().getEndPoint();
Rectangle rect = new Rectangle(curBaseline.get(0), curBaseline.get(1), topRight.get(0), topRight.get(1));
float curFontSize = rect.getHeight();
回答3:
I had some trouble using Alexis' and Prine's solution, since it doesn't deal with rotated text correctly. So this is what I do (sorry, in Scala):
val x0 = info.getAscentLine.getEndPoint
val x1 = info.getBaseline.getStartPoint
val x2 = info.getBaseline.getEndPoint
val length1 = (x2.subtract(x1)).cross((x1.subtract(x0))).lengthSquared
val length2 = x2.subtract(x1).lengthSquared
(length1, length2) match {
case (0, 0) => 0
case _ => length1 / length2
}
回答4:
If you want the exact fontsize, use the following code in your renderText:
float fontsize = renderInfo.getAscentLine().getStartPoint().get(1)
- renderInfo.getDescentLine().getStartPoint().get(1);
Modify this as indicated in the other answers for rorated text.
来源:https://stackoverflow.com/questions/10879336/itext-get-font-size-and-family-of-a-text-segment