问题
I have extracted text from pdf line by line using pdfbox, to process it with my algorithm by sentences.
I am recognizing the sentences by using period(.) followed by a word whose first letter is capital. Here the issue is, when a sentence ends with a word which has superscript, extractor treats it as a normal character and places it next to period(.)
For example: expression "2 power 22" when appeared as a last word in a sentence i.e. with a period, it has been extracted as 2.22 which makes it difficult to identify the end of sentence.
Please suggest a solution to get rid of super script or a different logic to identify the end of sentence.
Thanks.
回答1:
I am answering my own questions, as some may get directed here.
I had solved this according to @mkl suggestion. After observing the result of getYScale() in PDFStreamEngine.java, I have come to a conclusion that the size of superscript was less than 8.9663. so I had kept a condition in the PDFStreamEngine.java before creating a TextPosition, which will be processed by PDFTextStripper.java. The code is below:
if(textXctm.getYScale()>=8.9663) {
processTextPosition(
new TextPosition(
pageRotation,
pageWidth,
pageHeight,
textMatrixStart,
endXPosition,
endYPosition,
totalVerticalDisplacementDisp,
widthText,
spaceWidthDisp,
c,
codePoints,
font,
fontSizeText,
(int)(fontSizeText * textMatrix.getXScale())
));
}
Let me know if my approach has any flaws in eliminating only the superscripts. Thanks.
来源:https://stackoverflow.com/questions/22720283/excluding-super-script-when-extracting-text-from-pdf