I need to search a word in a existing pdf file and i want to highlight the text or word
and save the pdf file
I have an idea using PdfAnnotation.CreateMarku
Thanks Jcis!
After a couple of hours of research and thinking, i found your solution, which helped me to solve my Problems.
there were 2 little bugs.
first: the stamper needs to be closed before the reader, otherwise it throws an exception.
Public Sub PDFTextGetter(ByVal pSearch As String, ByVal SC As StringComparison, ByVal SourceFile As String, ByVal DestinationFile As String)
Dim stamper As iTextSharp.text.pdf.PdfStamper = Nothing
Dim cb As iTextSharp.text.pdf.PdfContentByte = Nothing
Me.Cursor = Cursors.WaitCursor
If File.Exists(SourceFile) Then
Dim pReader As New PdfReader(SourceFile)
stamper = New iTextSharp.text.pdf.PdfStamper(pReader, New System.IO.FileStream(DestinationFile, FileMode.Create))
PB.Value = 0 : PB.Maximum = pReader.NumberOfPages
For page As Integer = 1 To pReader.NumberOfPages
Dim strategy As myLocationTextExtractionStrategy = New myLocationTextExtractionStrategy
'cb = stamper.GetUnderContent(page)
cb = stamper.GetOverContent(page)
Dim state As New PdfGState()
state.FillOpacity = 0.3F
cb.SetGState(state)
'Send some data contained in PdfContentByte, looks like the first is always cero for me and the second 100, but i'm not sure if this could change in some cases
strategy.UndercontentCharacterSpacing = cb.CharacterSpacing
strategy.UndercontentHorizontalScaling = cb.HorizontalScaling
'It's not really needed to get the text back, but we have to call this line ALWAYS,
'because it triggers the process that will get all chunks from PDF into our strategy Object
Dim currentText As String = PdfTextExtractor.GetTextFromPage(pReader, page, strategy)
'The real getter process starts in the following line
Dim MatchesFound As List(Of iTextSharp.text.Rectangle) = strategy.GetTextLocations(pSearch, SC)
'Set the fill color of the shapes, I don't use a border because it would make the rect bigger
'but maybe using a thin border could be a solution if you see the currect rect is not big enough to cover all the text it should cover
cb.SetColorFill(BaseColor.PINK)
'MatchesFound contains all text with locations, so do whatever you want with it, this highlights them using PINK color:
For Each rect As iTextSharp.text.Rectangle In MatchesFound
' cb.Rectangle(rect.Left, rect.Bottom, rect.Width, rect.Height)
cb.SaveState()
cb.SetColorFill(BaseColor.YELLOW)
cb.Rectangle(rect.Left, rect.Bottom, rect.Width, rect.Height)
cb.Fill()
cb.RestoreState()
Next
'cb.Fill()
PB.Value = PB.Value + 1
Next
stamper.Close()
pReader.Close()
End If
Me.Cursor = Cursors.Default
End Sub
second: your solution dont work, when the searched text is in the last line of the extraced text.
Public Function GetTextLocations(ByVal pSearchString As String, ByVal pStrComp As System.StringComparison) As List(Of iTextSharp.text.Rectangle)
Dim FoundMatches As New List(Of iTextSharp.text.Rectangle)
Dim sb As New StringBuilder()
Dim ThisLineChunks As List(Of TextChunk) = New List(Of TextChunk)
Dim bStart As Boolean, bEnd As Boolean
Dim FirstChunk As TextChunk = Nothing, LastChunk As TextChunk = Nothing
Dim sTextInUsedChunks As String = vbNullString
' For Each chunk As TextChunk In locationalResult
For j As Integer = 0 To locationalResult.Count - 1
Dim chunk As TextChunk = locationalResult(j)
If chunk.text.Contains(pSearchString) Then
Thread.Sleep(1)
End If
If ThisLineChunks.Count > 0 AndAlso (Not chunk.SameLine(ThisLineChunks.Last) Or j = locationalResult.Count - 1) Then
If sb.ToString.IndexOf(pSearchString, pStrComp) > -1 Then
Dim sLine As String = sb.ToString
'Check how many times the Search String is present in this line:
Dim iCount As Integer = 0
Dim lPos As Integer
lPos = sLine.IndexOf(pSearchString, 0, pStrComp)
Do While lPos > -1
iCount += 1
If lPos + pSearchString.Length > sLine.Length Then Exit Do Else lPos = lPos + pSearchString.Length
lPos = sLine.IndexOf(pSearchString, lPos, pStrComp)
Loop
'Process each match found in this Text line:
Dim curPos As Integer = 0
For i As Integer = 1 To iCount
Dim sCurrentText As String, iFromChar As Integer, iToChar As Integer
iFromChar = sLine.IndexOf(pSearchString, curPos, pStrComp)
curPos = iFromChar
iToChar = iFromChar + pSearchString.Length - 1
sCurrentText = vbNullString
sTextInUsedChunks = vbNullString
FirstChunk = Nothing
LastChunk = Nothing
'Get first and last Chunks corresponding to this match found, from all Chunks in this line
For Each chk As TextChunk In ThisLineChunks
sCurrentText = sCurrentText & chk.text
'Check if we entered the part where we had found a matching String then get this Chunk (First Chunk)
If Not bStart AndAlso sCurrentText.Length - 1 >= iFromChar Then
FirstChunk = chk
bStart = True
End If
'Keep getting Text from Chunks while we are in the part where the matching String had been found
If bStart And Not bEnd Then
sTextInUsedChunks = sTextInUsedChunks & chk.text
End If
'If we get out the matching String part then get this Chunk (last Chunk)
If Not bEnd AndAlso sCurrentText.Length - 1 >= iToChar Then
LastChunk = chk
bEnd = True
End If
'If we already have first and last Chunks enclosing the Text where our String pSearchString has been found
'then it's time to get the rectangle, GetRectangleFromText Function below this Function, there we extract the pSearchString locations
If bStart And bEnd Then
FoundMatches.Add(GetRectangleFromText(FirstChunk, LastChunk, pSearchString, sTextInUsedChunks, iFromChar, iToChar, pStrComp))
curPos = curPos + pSearchString.Length
bStart = False : bEnd = False
Exit For
End If
Next
Next
End If
sb.Clear()
ThisLineChunks.Clear()
End If
ThisLineChunks.Add(chunk)
sb.Append(chunk.text)
Next
Return FoundMatches
End Function
I convert Jcis's VB project to WpfApplication C#(file in google drive) , and even apply Boris's bugfixes , but the project doesn`t run. It is very appreciated if someone who understands the algorithm of the program, fix it.
I've found how to do this, just in case someone needs to get words or sentences with locations (coordinates) from a PDF document you'll find this example Project HERE , I used VB.NET 2010 for this. Remember to add a reference to your iTextSharp DLL in this Project.
I added my own TextExtraction Strategy Class, based on Class LocationTextExtractionStrategy. I focused on TextChunks, because they already have these coordinates.
There are some known limitations like:
@Jcis, I actually managed a workaround for handling multiple searches using your example as a starting point. I use your project as a reference in a c# project, and altered what it does. Instead of just highlighting I actually have it drawing a white rectangle around the search term, and then using the rectangle coordinates, place a form field. I also had to swap the contentbyte writing mode to getovercontent so that I block out the searched text entirely. What I actually did was to create a string array of search terms, and then using a for loop, I create as many different text fields as I need.
Test.Form1 formBuilder = new Test.Form1();
string[] fields = new string[] { "%AccountNumber%", "%MeterNumber%", "%EmailFieldHolder%", "%AddressFieldHolder%", "%EmptyFieldHolder%", "%CityStateZipFieldHolder%", "%emptyFieldHolder1%", "%emptyFieldHolder2%", "%emptyFieldHolder3%", "%emptyFieldHolder4%", "%emptyFieldHolder5%", "%emptyFieldHolder6%", "%emptyFieldHolder7%", "%emptyFieldHolder8%", "%SiteNameFieldHolder%", "%SiteNameFieldHolderWithExtraSpace%" };
//int a = 0;
for (int a = 0; a < fields.Length; )
{
string[] fieldNames = fields[a].Split('%');
string[] fieldName = Regex.Split(fieldNames[1], "Field");
formBuilder.PDFTextGetter(fields[a], StringComparison.CurrentCultureIgnoreCase, htmlToPdf, finalhtmlToPdf, fieldName[0]);
File.Delete(htmlToPdf);
System.Array.Clear(fieldNames, 0, 2);
System.Array.Clear(fieldName, 0, 1);
a++;
if (a == fields.Length)
{
break;
}
string[] fieldNames1 = fields[a].Split('%');
string[] fieldName1 = Regex.Split(fieldNames1[1], "Field");
formBuilder.PDFTextGetter(fields[a], StringComparison.CurrentCultureIgnoreCase, finalhtmlToPdf, htmlToPdf, fieldName1[0]);
File.Delete(finalhtmlToPdf);
System.Array.Clear(fieldNames1, 0, 2);
System.Array.Clear(fieldName1, 0, 1);
a++;
}
It bounces the PDFTextGetter function in your example back and forth between two files until I achieve the finished product. It works really well, and it would not have been possible without your initial project, so thank you for that. I also altered your VB to do the text field mapping like so;
For Each rect As iTextSharp.text.Rectangle In MatchesFound
cb.Rectangle(rect.Left, rect.Bottom + 1, rect.Width, rect.Height + 4)
Dim field As New TextField(stamper.Writer, rect, FieldName & Fields)
Dim form = stamper.AcroFields
Dim fieldKeys = form.Fields.Keys
stamper.AddAnnotation(field.GetTextField(), page)
Fields += 1
Next
Just figured I would share what I managed to do with your project as a backbone. It even increments the field names as I need them to. I also had to add a new parameter to your function, but that's not worth listing here. Thank you again for this great head start.
This is one of those "sounds easy but is actually really complicated" things. See Mark's posts here and here. Ultimately you'll probably be pointed to LocationTextExtractionStrategy. Good luck! If you actually find out how to do it post it here, there several people wondering exactly what you are wondering!