Edit an existing PDF file using iTextSharp

后端 未结 2 980
死守一世寂寞
死守一世寂寞 2021-01-17 06:59

I have a pdf file which I am processing by converting it into text using the following coding..

ITextExtractionStrategy strategy = new SimpleTextExtractionSt         


        
2条回答
  •  滥情空心
    2021-01-17 07:29

    As already mentioned in comments: What you essentially need is a SimpleTextExtractionStrategy replacement which not only returns text but instead text with positions. The LocationTextExtractionStrategy would be a good starting point for that as it collects the text with positions (to put it in the right order).

    If you look into the source of LocationTextExtractionStrategy you'll see that it keeps its text pieces in a member List locationalResult. A TextChunk (inner class in LocationTextExtractionStrategy) represents a text piece (originally drawn by a single text drawing operation) with location information. In GetResultantText this list is sorted (top-to-bottom, left-to-right, all relative to the text base line) and reduced to a string.

    What you need, is something like this LocationTextExtractionStrategy with the difference that you retrieve the (sorted) text pieces including their positions.

    Unfortunately the locationalResult member is private. If it was at least protected, you could simply have derived your new strategy from LocationTextExtractionStrategy. Instead you now have to copy its source to add to it (or do some introspection/reflection magic).

    Your addition would be a new method similar to GetResultantText. This method might recognize all the text on the same line (just like GetResultantText does) and either

    • do the analysis / search for ambiguities itself and return a list of the locations (start and end) of any found ambiguities; or

    • put the text found for the current line into a single TextChunk instance together with the effective start and end locations of that line and eventually return a List each of which represents a text line; if you do this, the calling code would do the analysis to find ambiguities, and if it finds one, it has the start and end location of the line the ambiguity is on. Beware, TextChunk in the original strategy is protected but you need to make it public for this approach to work.

    Either way, you eventually have the start and end location of the ambiguities or at least of the lines the ambiguities are on. Now you have to highlight the line in question (as you say, you have to mark the entire line of the pdf(Color that line with Red)).

    To manipulate a given PDF you use a PdfStamper. You can mark a line on a page by either

    • getting the UnderContent for that page from the PdfStamper and fill a rectangle in red there using your position data; this disadvantage of this approach is that if the original PDF already has underlayed the line with filled areas, your mark will be hidden thereunder; or by

    • getting the OverContent for that page from the PdfStamper and fill a somewhat transparent rectangle in red; or by

    • adding a highlight annotation to the page.

    To make things even smoother, you might want to extend your copy of TextChunk (inner class in your copy of LocationTextExtractionStrategy) to not only keep the base line coordinates but also maximal ascent and descent of the glyphs used. Obviously you'd have to fill-in those information in RenderText...

    Doing so you know exactly the height required for your marking rectangle.

提交回复
热议问题