Removing Watermark from a PDF using iTextSharp

后端 未结 2 870
栀梦
栀梦 2020-12-03 06:34

I added a watermark on pdf using Pdfstamper. Here is the code:

for (int pageIndex = 1; pageIndex <= pageCount; pageIndex++)
{
    iTextSharp.text.Rectangl         


        
相关标签:
2条回答
  • 2020-12-03 06:47

    I'm going to give you the benefit of the doubt based on the statement "I even tried to add watermark as layer" and assume that you are working on content that you are creating and not trying to unwatermark someone else's content.

    PDFs use Optional Content Groups (OCG) to store objects as layers. If you add your watermark text to a layer you can fairly easily remove it later.

    The code below is a full working C# 2010 WinForms app targeting iTextSharp 5.1.1.0. It uses code based on Bruno's original Java code found here. The code is in three sections. Section 1 creates a sample PDF for us to work with. Section 2 creates a new PDF from the first and applies a watermark to each page on a separate layer. Section 3 creates a final PDF from the second but removes the layer with our watermark text. See the code comments for additional details.

    When you create a PdfLayer object you can assign it a name to appear within a PDF reader. Unfortunately I can't find a way to access this name so the code below looks for the actual watermark text within the layer. If you aren't using additional PDF layers I would recommend only looking for /OC within the content stream and not wasting time looking for your actual watermark text. If you find a way to look for /OC groups by name please let me kwow!

    using System;
    using System.Windows.Forms;
    using System.IO;
    using iTextSharp.text;
    using iTextSharp.text.pdf;
    
    namespace WindowsFormsApplication1 {
        public partial class Form1 : Form {
            public Form1() {
                InitializeComponent();
            }
    
            private void Form1_Load(object sender, EventArgs e) {
                string workingFolder = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
                string startFile = Path.Combine(workingFolder, "StartFile.pdf");
                string watermarkedFile = Path.Combine(workingFolder, "Watermarked.pdf");
                string unwatermarkedFile = Path.Combine(workingFolder, "Un-watermarked.pdf");
                string watermarkText = "This is a test";
    
                //SECTION 1
                //Create a 5 page PDF, nothing special here
                using (FileStream fs = new FileStream(startFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
                    using (Document doc = new Document(PageSize.LETTER)) {
                        using (PdfWriter witier = PdfWriter.GetInstance(doc, fs)) {
                            doc.Open();
    
                            for (int i = 1; i <= 5; i++) {
                                doc.NewPage();
                                doc.Add(new Paragraph(String.Format("This is page {0}", i)));
                            }
    
                            doc.Close();
                        }
                    }
                }
    
                //SECTION 2
                //Create our watermark on a separate layer. The only different here is that we are adding the watermark to a PdfLayer which is an OCG or Optional Content Group
                PdfReader reader1 = new PdfReader(startFile);
                using (FileStream fs = new FileStream(watermarkedFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
                    using (PdfStamper stamper = new PdfStamper(reader1, fs)) {
                        int pageCount1 = reader1.NumberOfPages;
                        //Create a new layer
                        PdfLayer layer = new PdfLayer("WatermarkLayer", stamper.Writer);
                        for (int i = 1; i <= pageCount1; i++) {
                            iTextSharp.text.Rectangle rect = reader1.GetPageSize(i);
                            //Get the ContentByte object
                            PdfContentByte cb = stamper.GetUnderContent(i);
                            //Tell the CB that the next commands should be "bound" to this new layer
                            cb.BeginLayer(layer);
                            cb.SetFontAndSize(BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED), 50);
                            PdfGState gState = new PdfGState();
                            gState.FillOpacity = 0.25f;
                            cb.SetGState(gState);
                            cb.SetColorFill(BaseColor.BLACK);
                            cb.BeginText();
                            cb.ShowTextAligned(PdfContentByte.ALIGN_CENTER, watermarkText, rect.Width / 2, rect.Height / 2, 45f);
                            cb.EndText();
                            //"Close" the layer
                            cb.EndLayer();
                        }
                    }
                }
    
                //SECTION 3
                //Remove the layer created above
                //First we bind a reader to the watermarked file, then strip out a bunch of things, and finally use a simple stamper to write out the edited reader
                PdfReader reader2 = new PdfReader(watermarkedFile);
    
                //NOTE, This will destroy all layers in the document, only use if you don't have additional layers
                //Remove the OCG group completely from the document.
                //reader2.Catalog.Remove(PdfName.OCPROPERTIES);
    
                //Clean up the reader, optional
                reader2.RemoveUnusedObjects();
    
                //Placeholder variables
                PRStream stream;
                String content;
                PdfDictionary page;
                PdfArray contentarray;
    
                //Get the page count
                int pageCount2 = reader2.NumberOfPages;
                //Loop through each page
                for (int i = 1; i <= pageCount2; i++) {
                    //Get the page
                    page = reader2.GetPageN(i);
                    //Get the raw content
                    contentarray = page.GetAsArray(PdfName.CONTENTS);
                    if (contentarray != null) {
                        //Loop through content
                        for (int j = 0; j < contentarray.Size; j++) {
                            //Get the raw byte stream
                            stream = (PRStream)contentarray.GetAsStream(j);
                            //Convert to a string. NOTE, you might need a different encoding here
                            content = System.Text.Encoding.ASCII.GetString(PdfReader.GetStreamBytes(stream));
                            //Look for the OCG token in the stream as well as our watermarked text
                            if (content.IndexOf("/OC") >= 0 && content.IndexOf(watermarkText) >= 0) {
                                //Remove it by giving it zero length and zero data
                                stream.Put(PdfName.LENGTH, new PdfNumber(0));
                                stream.SetData(new byte[0]);
                            }
                        }
                    }
                }
    
                //Write the content out
                using (FileStream fs = new FileStream(unwatermarkedFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
                    using (PdfStamper stamper = new PdfStamper(reader2, fs)) {
    
                    }
                }
                this.Close();
            }
        }
    }
    
    0 讨论(0)
  • 2020-12-03 07:05

    As an extension to Chris's answer, a VB.Net class for removing a layer is included at the bottom of this post which should be a bit more precise.

    1. It goes through the PDF's list of layers (stored in the OCGs array in the OCProperties dictionary in the file's catalog). This array contains indirect references to objects in the PDF file which contain the name
    2. It goes through the properties of the page (also stored in a dictionary) to find the properties which point to the layer objects (via indirect references)
    3. It does an actual parse of the content stream to find instances of the pattern /OC /{PagePropertyReference} BDC {Actual Content} EMC so it can remove just these segments as appropriate

    The code then cleans up all the references as much as it can. Calling the code might work as shown:

    Public Shared Sub RemoveWatermark(path As String, savePath As String)
      Using reader = New PdfReader(path)
        Using fs As New FileStream(savePath, FileMode.Create, FileAccess.Write, FileShare.None)
          Using stamper As New PdfStamper(reader, fs)
            Using remover As New PdfLayerRemover(reader)
              remover.RemoveByName("WatermarkLayer")
            End Using
          End Using
        End Using
      End Using
    End Sub
    

    Full class:

    Imports iTextSharp.text
    Imports iTextSharp.text.io
    Imports iTextSharp.text.pdf
    Imports iTextSharp.text.pdf.parser
    
    Public Class PdfLayerRemover
      Implements IDisposable
    
      Private _reader As PdfReader
      Private _layerNames As New List(Of String)
    
      Public Sub New(reader As PdfReader)
        _reader = reader
      End Sub
    
      Public Sub RemoveByName(name As String)
        _layerNames.Add(name)
      End Sub
    
      Private Sub RemoveLayers()
        Dim ocProps = _reader.Catalog.GetAsDict(PdfName.OCPROPERTIES)
        If ocProps Is Nothing Then Return
        Dim ocgs = ocProps.GetAsArray(PdfName.OCGS)
        If ocgs Is Nothing Then Return
    
        'Get a list of indirect references to the layer information
        Dim layerRefs = (From l In (From i In ocgs
                                    Select Obj = DirectCast(PdfReader.GetPdfObject(i), PdfDictionary),
                                           Ref = DirectCast(i, PdfIndirectReference))
                         Where _layerNames.Contains(l.Obj.GetAsString(PdfName.NAME).ToString)
                         Select l.Ref).ToList
        'Get a list of numbers for these layer references
        Dim layerRefNumbers = (From l In layerRefs Select l.Number).ToList
    
        'Loop through the pages
        Dim page As PdfDictionary
        Dim propsToRemove As IEnumerable(Of PdfName)
        For i As Integer = 1 To _reader.NumberOfPages
          'Get the page
          page = _reader.GetPageN(i)
    
          'Get the page properties which reference the layers to remove
          Dim props = _reader.GetPageResources(i).GetAsDict(PdfName.PROPERTIES)
          propsToRemove = (From k In props.Keys Where layerRefNumbers.Contains(props.GetAsIndirectObject(k).Number) Select k).ToList
    
          'Get the raw content
          Dim contentarray = page.GetAsArray(PdfName.CONTENTS)
          If contentarray IsNot Nothing Then
            For j As Integer = 0 To contentarray.Size - 1
              'Parse the stream data looking for references to a property pointing to the layer.
              Dim stream = DirectCast(contentarray.GetAsStream(j), PRStream)
              Dim streamData = PdfReader.GetStreamBytes(stream)
              Dim newData = GetNewStream(streamData, (From p In propsToRemove Select p.ToString.Substring(1)))
    
              'Store data without the stream references in the stream
              If newData.Length <> streamData.Length Then
                stream.SetData(newData)
                stream.Put(PdfName.LENGTH, New PdfNumber(newData.Length))
              End If
            Next
          End If
    
          'Remove the properties from the page data
          For Each prop In propsToRemove
            props.Remove(prop)
          Next
        Next
    
        'Remove references to the layer in the master catalog
        RemoveIndirectReferences(ocProps, layerRefNumbers)
    
        'Clean up unused objects
        _reader.RemoveUnusedObjects()
      End Sub
    
      Private Shared Function GetNewStream(data As Byte(), propsToRemove As IEnumerable(Of String)) As Byte()
        Dim item As PdfLayer = Nothing
        Dim positions As New List(Of Integer)
        positions.Add(0)
    
        Dim pos As Integer
        Dim inGroup As Boolean = False
        Dim tokenizer As New PRTokeniser(New RandomAccessFileOrArray(New RandomAccessSourceFactory().CreateSource(data)))
        While tokenizer.NextToken
          If tokenizer.TokenType = PRTokeniser.TokType.NAME AndAlso tokenizer.StringValue = "OC" Then
            pos = CInt(tokenizer.FilePointer - 3)
            If tokenizer.NextToken() AndAlso tokenizer.TokenType = PRTokeniser.TokType.NAME Then
              If Not inGroup AndAlso propsToRemove.Contains(tokenizer.StringValue) Then
                inGroup = True
                positions.Add(pos)
              End If
            End If
          ElseIf tokenizer.TokenType = PRTokeniser.TokType.OTHER AndAlso tokenizer.StringValue = "EMC" AndAlso inGroup Then
            positions.Add(CInt(tokenizer.FilePointer))
            inGroup = False
          End If
        End While
        positions.Add(data.Length)
    
        If positions.Count > 2 Then
          Dim length As Integer = 0
          For i As Integer = 0 To positions.Count - 1 Step 2
            length += positions(i + 1) - positions(i)
          Next
    
          Dim newData(length) As Byte
          length = 0
          For i As Integer = 0 To positions.Count - 1 Step 2
            Array.Copy(data, positions(i), newData, length, positions(i + 1) - positions(i))
            length += positions(i + 1) - positions(i)
          Next
    
          Dim origStr = System.Text.Encoding.UTF8.GetString(data)
          Dim newStr = System.Text.Encoding.UTF8.GetString(newData)
    
          Return newData
        Else
          Return data
        End If
      End Function
    
      Private Shared Sub RemoveIndirectReferences(dict As PdfDictionary, refNumbers As IEnumerable(Of Integer))
        Dim newDict As PdfDictionary
        Dim arrayData As PdfArray
        Dim indirect As PdfIndirectReference
        Dim i As Integer
    
        For Each key In dict.Keys
          newDict = dict.GetAsDict(key)
          arrayData = dict.GetAsArray(key)
          If newDict IsNot Nothing Then
            RemoveIndirectReferences(newDict, refNumbers)
          ElseIf arrayData IsNot Nothing Then
            i = 0
            While i < arrayData.Size
              indirect = arrayData.GetAsIndirectObject(i)
              If refNumbers.Contains(indirect.Number) Then
                arrayData.Remove(i)
              Else
                i += 1
              End If
            End While
          End If
        Next
      End Sub
    
    #Region "IDisposable Support"
      Private disposedValue As Boolean ' To detect redundant calls
    
      ' IDisposable
      Protected Overridable Sub Dispose(disposing As Boolean)
        If Not Me.disposedValue Then
          If disposing Then
            RemoveLayers()
          End If
    
          ' TODO: free unmanaged resources (unmanaged objects) and override Finalize() below.
          ' TODO: set large fields to null.
        End If
        Me.disposedValue = True
      End Sub
    
      ' TODO: override Finalize() only if Dispose(ByVal disposing As Boolean) above has code to free unmanaged resources.
      'Protected Overrides Sub Finalize()
      '    ' Do not change this code.  Put cleanup code in Dispose(ByVal disposing As Boolean) above.
      '    Dispose(False)
      '    MyBase.Finalize()
      'End Sub
    
      ' This code added by Visual Basic to correctly implement the disposable pattern.
      Public Sub Dispose() Implements IDisposable.Dispose
        ' Do not change this code.  Put cleanup code in Dispose(ByVal disposing As Boolean) above.
        Dispose(True)
        GC.SuppressFinalize(Me)
      End Sub
    #End Region
    
    End Class
    
    0 讨论(0)
提交回复
热议问题