I highlighted the word in pdf using the code in the answer to the following question: [Highlight words in a pdf using itextsharp, not displaying highlighted word in browser]2
Due to my <50 rep this is posted as an answer but really it is a comment.
Thanks mkl for pointing me in the right direction however the problem remains for me sometimes event when this is taken into account (am working in VB).
For some PDFs (from a lists of similar documents scanned) I too get the error and it turns out that the pFont.GetWidth call is returning ZERO for each character hence the result of GetStringWidth returns ZERO also.
I note that I can pass GetWidth the whole string or one character at a time and the results are the same - when it works it works great when not I get zero.
I downloaded the latest ITEXT (5-5-10-0) and the problem remains - since this problem occurs rarely I have temporarily resorted to checking the output of getstringwidth and if it is to become zero then I estimate a response based on an average with a cross check against it being too high versus the pSingleSpaceWidth value.
I know this is quite poor but my requirements are modest, it happens rarely and I haven't the time or requisite knowledge (of Itext/PDF) to investigate how to obtain an equivalent GetWidth result accurately.
I do hope this info helps someone and possibly if someone does have the required knowledge to improve/fix this idea I am all ears.
I could reproduce the issue. It is not an issue of iTextSharp but of my prior answer the code of which the OP uses; additionally there may be an issue in the myLocationTextExtractionStrategy
class provided by a third party, Jcis, in this answer, or merely an issue in the vb-to-c# translator I translated it with to not have to work with VB and C# in a mix. Additionally there is an issue in the OP's code causing excessive markings.
highlightPDFAnnotation
from hereIn that method I use the
myLocationTextExtractionStrategy strategy = new myLocationTextExtractionStrategy();
without any further initialization, but one member needs to be initialized:
strategy.UndercontentHorizontalScaling = 100;
If one fails to do this, the strategy unfortunately uses a 0 default value. As a result all string width calculations are multiplied by 0...
I'm going to fix the code in that answer after finishing this one.
myLocationTextExtractionStrategy
, at least in my C# version of itThe myLocationTextExtractionStrategy
method GetStringWidth
in the C# translation I work with is defined as
private float GetStringWidth(string str, float curFontSize, float pSingleSpaceWidth, DocumentFont pFont)
{
char[] chars = str.ToCharArray();
float totalWidth = 0;
float w = 0;
foreach (char c in chars) {
w = pFont.GetWidth(c) / 1000;
totalWidth += (w * curFontSize + this.UndercontentCharacterSpacing) * this.UndercontentHorizontalScaling / 100;
}
return totalWidth;
}
Unfortunately pFont.GetWidth(c)
is defined to return an int
; thus, the division pFont.GetWidth(c) / 1000
is executed as an integer operation, so nearly all the time the result was 0 (because glyph widths usually are between 0 and 1000). This line had to be corrected to:
w = pFont.GetWidth(c) / 1000.0f;
The original VB version looks like this:
Private Function GetStringWidth(ByVal str As String, ByVal curFontSize As Single, ByVal pSingleSpaceWidth As Single, ByVal pFont As DocumentFont) As Single
Dim chars() As Char = str.ToCharArray()
Dim totalWidth As Single = 0
Dim w As Single = 0
For Each c As Char In chars
w = pFont.GetWidth(c) / 1000
totalWidth += (w * curFontSize + Me.UndercontentCharacterSpacing) * Me.UndercontentHorizontalScaling / 100
Next
Return totalWidth
End Function
I don't know VB good enough to tell whether here w = pFont.GetWidth(c) / 1000
is calculated as a floating point or an integer operation. In the latter case this would indicate an issue here, too.
Interestingly both issues had the same effect: GetStringWidth
always returns 0. This method is used in GetRectangleFromText
to calculate a correction factor:
'Text Width in User Space Units
Dim LineRealWidth As Single = LastChunk.PosRight - FirstChunk.PosLeft
'Text Width in Text Units
Dim LineTextWidth As Single = GetStringWidth(sTextinChunks, LastChunk.curFontSize, LastChunk.charSpaceWidth, ThisPdfDocFonts.Values.ElementAt(LastChunk.FontIndex))
'TransformationValue value for Interpolation
Dim TransformationValue As Single = LineRealWidth / LineTextWidth
Thus, TransformationValue
in the end was at infinity.
This TransformationValue
later on was multiplied with an offset in certain special cases resulting in rectangles like this:
/Rect[259.2 679.86 -9223372036854775808 692.58]
which go out of range for Adobe Reader.
But there also is a genuine issue in the OP's code in this question here, he uses a global strategy object
myLocationTextExtractionStrategy strategy = new myLocationTextExtractionStrategy();
int pageCount = reader.NumberOfPages;
for (int pageno = 1; pageno <= pageCount; pageno++)
{
string currentText = PdfTextExtractor.GetTextFromPage(reader, pageno, strategy);
Unfortunately the strategy remembers everything it has parsed. While processing a page, therefore, the code thought its content was the aggregated content of all pages up to the current one.
So it should instead be
int pageCount = reader.NumberOfPages;
for (int pageno = 1; pageno <= pageCount; pageno++)
{
myLocationTextExtractionStrategy strategy = new myLocationTextExtractionStrategy();
strategy.UndercontentHorizontalScaling = 100;
string currentText = PdfTextExtractor.GetTextFromPage(reader, pageno, strategy);
Now it works for me.
myLocationTextExtractionStrategy
in C#using System;
using System.Collections;
using System.Collections.Generic;
using System.Data;
using System.Diagnostics;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
// how to highlight a text or word in a pdf file using iTextsharp?
// https://stackoverflow.com/questions/6523243/how-to-highlight-a-text-or-word-in-a-pdf-file-using-itextsharp
//
// myLocationTextExtractionStrategy has been shared by Jcis in his answer
// https://stackoverflow.com/a/11076968/1729265
// as VB class. This has been translated to C# using Telerik with some minor
// translation glitch fixes.
namespace iTextSharp_Playground.Content
{
public class myLocationTextExtractionStrategy : ITextExtractionStrategy
{
private float _UndercontentCharacterSpacing = 0;
private float _UndercontentHorizontalScaling = 0;
private SortedList<string, DocumentFont> ThisPdfDocFonts;
public static bool DUMP_STATE = false;
//* a summary of all found text
private List<TextChunk> locationalResult = new List<TextChunk>();
//*
// * Creates a new text extraction renderer.
//
public myLocationTextExtractionStrategy()
{
ThisPdfDocFonts = new SortedList<string, DocumentFont>();
}
//*
// * @see com.itextpdf.text.pdf.parser.RenderListener#beginTextBlock()
//
public virtual void BeginTextBlock()
{
}
//*
// * @see com.itextpdf.text.pdf.parser.RenderListener#endTextBlock()
//
public virtual void EndTextBlock()
{
}
//*
// * @param str
// * @return true if the string starts with a space character, false if the string is empty or starts with a non-space character
//
private bool StartsWithSpace(String str)
{
if (str.Length == 0) {
return false;
}
return str[0] == ' ';
}
//*
// * @param str
// * @return true if the string ends with a space character, false if the string is empty or ends with a non-space character
//
private bool EndsWithSpace(String str)
{
if (str.Length == 0) {
return false;
}
return str[str.Length - 1] == ' ';
}
public float UndercontentCharacterSpacing {
get { return _UndercontentCharacterSpacing; }
set { _UndercontentCharacterSpacing = value; }
}
public float UndercontentHorizontalScaling
{
get { return _UndercontentHorizontalScaling; }
set { _UndercontentHorizontalScaling = value; }
}
public virtual String GetResultantText()
{
if (DUMP_STATE) {
DumpState();
}
locationalResult.Sort();
StringBuilder sb = new StringBuilder();
TextChunk lastChunk = null;
foreach (TextChunk chunk in locationalResult) {
if (lastChunk == null) {
sb.Append(chunk.text);
} else {
if (chunk.SameLine(lastChunk)) {
float dist = chunk.DistanceFromEndOf(lastChunk);
if (dist < -chunk.charSpaceWidth) {
sb.Append(' ');
// we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
} else if (dist > chunk.charSpaceWidth / 2f && !StartsWithSpace(chunk.text) && !EndsWithSpace(lastChunk.text)) {
sb.Append(' ');
}
sb.Append(chunk.text);
} else {
sb.Append((char)10);
sb.Append(chunk.text);
}
}
lastChunk = chunk;
}
return sb.ToString();
}
public List<iTextSharp.text.Rectangle> GetTextLocations(string pSearchString, System.StringComparison pStrComp)
{
List<iTextSharp.text.Rectangle> FoundMatches = new List<iTextSharp.text.Rectangle>();
StringBuilder sb = new StringBuilder();
List<TextChunk> ThisLineChunks = new List<TextChunk>();
bool bStart = false;
bool bEnd = false;
TextChunk FirstChunk = null;
TextChunk LastChunk = null;
string sTextInUsedChunks = null;
foreach (TextChunk chunk in locationalResult) {
if (ThisLineChunks.Count > 0 && !chunk.SameLine(ThisLineChunks[ThisLineChunks.Count-1]))
{
if (sb.ToString().IndexOf(pSearchString, pStrComp) > -1) {
string sLine = sb.ToString();
//Check how many times the Search String is present in this line:
int iCount = 0;
int lPos = 0;
lPos = sLine.IndexOf(pSearchString, 0, pStrComp);
while (lPos > -1) {
iCount += 1;
if (lPos + pSearchString.Length > sLine.Length)
break; // TODO: might not be correct. Was : Exit Do
else
lPos = lPos + pSearchString.Length;
lPos = sLine.IndexOf(pSearchString, lPos, pStrComp);
}
//Process each match found in this Text line:
int curPos = 0;
for (int i = 1; i <= iCount; i++) {
string sCurrentText = null;
int iFromChar = 0;
int iToChar = 0;
iFromChar = sLine.IndexOf(pSearchString, curPos, pStrComp);
curPos = iFromChar;
iToChar = iFromChar + pSearchString.Length - 1;
sCurrentText = null;
sTextInUsedChunks = null;
FirstChunk = null;
LastChunk = null;
//Get first and last Chunks corresponding to this match found, from all Chunks in this line
foreach (TextChunk chk in ThisLineChunks) {
sCurrentText = sCurrentText + chk.text;
//Check if we entered the part where we had found a matching String then get this Chunk (First Chunk)
if (!bStart && sCurrentText.Length - 1 >= iFromChar) {
FirstChunk = chk;
bStart = true;
}
//Keep getting Text from Chunks while we are in the part where the matching String had been found
if (bStart & !bEnd) {
sTextInUsedChunks = sTextInUsedChunks + chk.text;
}
//If we get out the matching String part then get this Chunk (last Chunk)
if (!bEnd && sCurrentText.Length - 1 >= iToChar) {
LastChunk = chk;
bEnd = true;
}
//If we already have first and last Chunks enclosing the Text where our String pSearchString has been found
//then it's time to get the rectangle, GetRectangleFromText Function below this Function, there we extract the pSearchString locations
if (bStart & bEnd) {
FoundMatches.Add(GetRectangleFromText(FirstChunk, LastChunk, pSearchString, sTextInUsedChunks, iFromChar, iToChar, pStrComp));
curPos = curPos + pSearchString.Length;
bStart = false;
bEnd = false;
break; // TODO: might not be correct. Was : Exit For
}
}
}
}
sb.Clear();
ThisLineChunks.Clear();
}
ThisLineChunks.Add(chunk);
sb.Append(chunk.text);
}
return FoundMatches;
}
private iTextSharp.text.Rectangle GetRectangleFromText(TextChunk FirstChunk, TextChunk LastChunk, string pSearchString, string sTextinChunks, int iFromChar, int iToChar, System.StringComparison pStrComp)
{
//There are cases where Chunk contains extra text at begining and end, we don't want this text locations, we need to extract the pSearchString location inside
//for these cases we need to crop this String (left and Right), and measure this excedent at left and right, at this point we don't have any direct way to make a
//Transformation from text space points to User Space units, the matrix for making this transformation is not accesible from here, so for these special cases when
//the String needs to be cropped (Left/Right) We'll interpolate between the width from Text in Chunk (we have this value in User Space units), then i'll measure Text corresponding
//to the same String but in Text Space units, finally from the relation betweeenthese 2 values I get the TransformationValue I need to use for all cases
//Text Width in User Space Units
float LineRealWidth = LastChunk.PosRight - FirstChunk.PosLeft;
//Text Width in Text Units
float LineTextWidth = GetStringWidth(sTextinChunks, LastChunk.curFontSize, LastChunk.charSpaceWidth, ThisPdfDocFonts.Values[LastChunk.FontIndex]);
//TransformationValue value for Interpolation
float TransformationValue = LineRealWidth / LineTextWidth;
//In the worst case, we'll need to crop left and right:
int iStart = sTextinChunks.IndexOf(pSearchString, pStrComp);
int iEnd = iStart + pSearchString.Length - 1;
string sLeft = null;
if (iStart == 0)
sLeft = null;
else
sLeft = sTextinChunks.Substring(0, iStart);
string sRight = null;
if (iEnd == sTextinChunks.Length - 1)
sRight = null;
else
sRight = sTextinChunks.Substring(iEnd + 1, sTextinChunks.Length - iEnd - 1);
//Measure cropped Text at left:
float LeftWidth = 0;
if (iStart > 0) {
LeftWidth = GetStringWidth(sLeft, LastChunk.curFontSize, LastChunk.charSpaceWidth, ThisPdfDocFonts.Values[LastChunk.FontIndex]);
LeftWidth = LeftWidth * TransformationValue;
}
//Measure cropped Text at right:
float RightWidth = 0;
if (iEnd < sTextinChunks.Length - 1) {
RightWidth = GetStringWidth(sRight, LastChunk.curFontSize, LastChunk.charSpaceWidth, ThisPdfDocFonts.Values[LastChunk.FontIndex]);
RightWidth = RightWidth * TransformationValue;
}
//LeftWidth is the text width at left we need to exclude, FirstChunk.distParallelStart is the distance to left margin, both together will give us this LeftOffset
float LeftOffset = FirstChunk.distParallelStart + LeftWidth;
//RightWidth is the text width at right we need to exclude, FirstChunk.distParallelEnd is the distance to right margin, we substract RightWidth from distParallelEnd to get RightOffset
float RightOffset = LastChunk.distParallelEnd - RightWidth;
//Return this Rectangle
return new iTextSharp.text.Rectangle(LeftOffset, FirstChunk.PosBottom, RightOffset, FirstChunk.PosTop);
}
private float GetStringWidth(string str, float curFontSize, float pSingleSpaceWidth, DocumentFont pFont)
{
char[] chars = str.ToCharArray();
float totalWidth = 0;
float w = 0;
foreach (char c in chars) {
w = pFont.GetWidth(c) / 1000f;
totalWidth += (w * curFontSize + this.UndercontentCharacterSpacing) * this.UndercontentHorizontalScaling / 100;
}
return totalWidth;
}
private void DumpState()
{
foreach (TextChunk location in locationalResult) {
location.PrintDiagnostics();
Console.WriteLine();
}
}
public virtual void RenderText(TextRenderInfo renderInfo)
{
LineSegment segment = renderInfo.GetBaseline();
TextChunk location = new TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth());
var _with1 = location;
//Chunk Location:
// Debug.Print(renderInfo.GetText);
_with1.PosLeft = renderInfo.GetDescentLine().GetStartPoint()[Vector.I1];
_with1.PosRight = renderInfo.GetAscentLine().GetEndPoint()[Vector.I1];
_with1.PosBottom = renderInfo.GetDescentLine().GetStartPoint()[Vector.I2];
_with1.PosTop = renderInfo.GetAscentLine().GetEndPoint()[Vector.I2];
//Chunk Font Size: (Height)
_with1.curFontSize = _with1.PosTop - segment.GetStartPoint()[Vector.I2];
//Use Font name and Size as Key in the SortedList
string StrKey = renderInfo.GetFont().PostscriptFontName + _with1.curFontSize.ToString();
//Add this font to ThisPdfDocFonts SortedList if it's not already present
if (!ThisPdfDocFonts.ContainsKey(StrKey))
ThisPdfDocFonts.Add(StrKey, renderInfo.GetFont());
//Store the SortedList index in this Chunk, so we can get it later
_with1.FontIndex = ThisPdfDocFonts.IndexOfKey(StrKey);
locationalResult.Add(location);
}
//*
// * Represents a chunk of text, it's orientation, and location relative to the orientation vector
//
public class TextChunk : IComparable<TextChunk>
{
//* the text of the chunk
internal String text;
//* the starting location of the chunk
internal Vector startLocation;
//* the ending location of the chunk
internal Vector endLocation;
//* unit vector in the orientation of the chunk
internal Vector orientationVector;
//* the orientation as a scalar for quick sorting
internal int orientationMagnitude;
//* perpendicular distance to the orientation unit vector (i.e. the Y position in an unrotated coordinate system)
// * we round to the nearest integer to handle the fuzziness of comparing floats
internal int distPerpendicular;
//* distance of the start of the chunk parallel to the orientation unit vector (i.e. the X position in an unrotated coordinate system)
internal float distParallelStart;
//* distance of the end of the chunk parallel to the orientation unit vector (i.e. the X position in an unrotated coordinate system)
internal float distParallelEnd;
//* the width of a single space character in the font of the chunk
internal float charSpaceWidth;
private float _PosLeft;
private float _PosRight;
private float _PosTop;
private float _PosBottom;
private float _curFontSize;
private int _FontIndex;
public int FontIndex {
get { return _FontIndex; }
set { _FontIndex = value; }
}
public float PosLeft {
get { return _PosLeft; }
set { _PosLeft = value; }
}
public float PosRight {
get { return _PosRight; }
set { _PosRight = value; }
}
public float PosTop {
get { return _PosTop; }
set { _PosTop = value; }
}
public float PosBottom {
get { return _PosBottom; }
set { _PosBottom = value; }
}
public float curFontSize {
get { return _curFontSize; }
set { _curFontSize = value; }
}
public TextChunk(String str, Vector startLocation, Vector endLocation, float charSpaceWidth)
{
this.text = str;
this.startLocation = startLocation;
this.endLocation = endLocation;
this.charSpaceWidth = charSpaceWidth;
Vector oVector = endLocation.Subtract(startLocation);
if (oVector.Length == 0) {
oVector = new Vector(1, 0, 0);
}
orientationVector = oVector.Normalize();
orientationMagnitude = Convert.ToInt32(Math.Truncate(Math.Atan2(orientationVector[Vector.I2], orientationVector[Vector.I1]) * 1000));
Vector origin = new Vector(0, 0, 1);
distPerpendicular = Convert.ToInt32((startLocation.Subtract(origin)).Cross(orientationVector)[Vector.I3]);
distParallelStart = orientationVector.Dot(startLocation);
distParallelEnd = orientationVector.Dot(endLocation);
}
public void PrintDiagnostics()
{
Console.WriteLine("Text (@" + Convert.ToString(startLocation) + " -> " + Convert.ToString(endLocation) + "): " + text);
Console.WriteLine("orientationMagnitude: " + orientationMagnitude);
Console.WriteLine("distPerpendicular: " + distPerpendicular);
Console.WriteLine("distParallel: " + distParallelStart);
}
//*
// * @param as the location to compare to
// * @return true is this location is on the the same line as the other
//
public bool SameLine(TextChunk a)
{
if (orientationMagnitude != a.orientationMagnitude) {
return false;
}
if (distPerpendicular != a.distPerpendicular) {
return false;
}
return true;
}
//*
// * Computes the distance between the end of 'other' and the beginning of this chunk
// * in the direction of this chunk's orientation vector. Note that it's a bad idea
// * to call this for chunks that aren't on the same line and orientation, but we don't
// * explicitly check for that condition for performance reasons.
// * @param other
// * @return the number of spaces between the end of 'other' and the beginning of this chunk
//
public float DistanceFromEndOf(TextChunk other)
{
float distance = distParallelStart - other.distParallelEnd;
return distance;
}
//*
// * Compares based on orientation, perpendicular distance, then parallel distance
// * @see java.lang.Comparable#compareTo(java.lang.Object)
//
public int CompareTo(TextChunk rhs)
{
if (object.ReferenceEquals(this, rhs)) {
return 0;
}
// not really needed, but just in case
int rslt = 0;
rslt = CompareInts(orientationMagnitude, rhs.orientationMagnitude);
if (rslt != 0) {
return rslt;
}
rslt = CompareInts(distPerpendicular, rhs.distPerpendicular);
if (rslt != 0) {
return rslt;
}
// note: it's never safe to check floating point numbers for equality, and if two chunks
// are truly right on top of each other, which one comes first or second just doesn't matter
// so we arbitrarily choose this way.
rslt = distParallelStart < rhs.distParallelStart ? -1 : 1;
return rslt;
}
//*
// *
// * @param int1
// * @param int2
// * @return comparison of the two integers
//
private static int CompareInts(int int1, int int2)
{
return int1 == int2 ? 0 : int1 < int2 ? -1 : 1;
}
}
//*
// * no-op method - this renderer isn't interested in image events
// * @see com.itextpdf.text.pdf.parser.RenderListener#renderImage(com.itextpdf.text.pdf.parser.ImageRenderInfo)
// * @since 5.0.1
//
public void RenderImage(ImageRenderInfo renderInfo)
{
// do nothing
}
}
}
//=======================================================
//Service provided by Telerik (www.telerik.com)
//Conversion powered by NRefactory.
//Twitter: @telerik
//Facebook: facebook.com/telerik
//=======================================================