How can I extract a JavaScript from a PDF file with a command line tool?

前端 未结 1 2117
清歌不尽
清歌不尽 2021-02-05 23:30

How can I extract a JavaScript object from a PDF file using a command line tool?

I am trying to make a GUI using Python with this function.

I found th

1条回答
  •  一向
    一向 (楼主)
    2021-02-05 23:55

    When you deal with JavaScript in PDFs, you have to be aware of two cases (which you cannot necessarily distinguish in advance, before closely investigating the file in question).

    1. "Harmless" JavaScript
    2. Malicious JavaScript

    Case 1: Harmless, "useful", "open" JavaScript

    The OP gave a link to a sample JavaScript-loaded PDF from PlanetPDF:

    • http://www.planetpdf.com/planetpdf/pdfs/ppjslc_commonex_3.pdf

    That one is easy to handle. Just use pdfinfo -js (but be sure that you use one of the most recent, Poppler-based releases -- the XPDF-based pdfinfo does not know about -js!)

    Here is the result:

    $ pdfinfo -js ppjslc_commonex_3.pdf
    
     Title:          Planet PDF JavaScript Learning Center Example #2
     Author:         Chris Dahl, ARTS PDF Global Services
     Creator:        PScript5.dll Version 5.2.2
     Producer:       Acrobat Distiller 6.0.1 (Windows)
     CreationDate:   Thu Oct 28 18:13:38 2004
     ModDate:        Thu Oct 28 18:17:46 2004
     Tagged:         no
     UserProperties: no
     Suspects:       no
     Form:           AcroForm
     JavaScript:     yes
     Pages:          1
     Encrypted:      no
     Page size:      612 x 792 pts (letter)
     Page rot:       0
     File size:      84720 bytes
     Optimized:      no
     PDF version:    1.5
    
     Name Dictionary "docOpened":
     // variable to store whether document has been opened already or not
     var bAlreadyOpened;
    
     function docOpened()
     {
    
        if(bAlreadyOpened != "true")
        {
            // document has just been opened
            var d = new Date();
            var sDate = util.printd("mm/dd/yyyy", d);
    
                     // set date now
                     app.alert("About to insert date into field now");
            this.getField("todaysDate").value = sDate;
    
            // now set bAlreadyOpened to true so it doesn’t
            // run again
     bAlreadyOpened = "true";
        }
        else
        {
            // document has already been opened
        }
     }
    
     // call the docOpened() function
     docOpened();
    

    As you can see, -js attempts to automatically extract all JavaScript from the PDF and prints it to .

    This one was a harmless JavaScript, not trying to hide itself, not obfuscated, inserting the current date into a form field, after popping up an info message about what it is going to do.

    Case 2: Malicious, damaging, hidden and obfuscated JavaScript

    There are numerous examples of PDFs out in the wilderness containing JavaScripts which are not as harmless as the above, written by Malware authors who are after your money, or just after the "fun" it gives them if they succeed.

    The JavaScripts in these cases are very frequently hidden and obfuscated.

    For example, in order to hide the fact that there is even JavaScript contained, they do not use the 'clear' /JavaScript and /JS names in the respective PDF object dictionaries. These names must be present for the PDF readers to know what they should do with the object.

    Instead, they use another method to express the same names:

    /#4Aava#53cript
    /J#61vaScrip#74
    /#4a#61#76#61#53#63#72#69#70#74
    [...]
    

    This method, unfortunately, was even made "legal" by the official PDF specification documents. It allows to replace a selection of some or even of all characters in a PDF name token by their respective ASCII hex number (combined with a leading hash sign for each replaced char).

    This can fool some of the more naive attempts to find the /JavaScript string inside a PDF (such as using a simple grep -a).

    There are a few Free Software tools available, which can be used to dissect and analyze such cases:

    • Didier Stevens' Python scripts pdfid.py and pdf-parser.py are very useful for a first look (and even for a complete analysis) of these cases.

    • Jose Miguel Esparza's Python framework peepdf is even more powerful. It can even de-obfuscate, beautify and make readable again any obfuscated JavaScript contents inside the PDF.

    • Origami is Ruby-based, and also quite powerful. And there are a few more...

    But all these tools are only useful if you already have (at least some basic) knowledge about PDF syntax (and about JavaScript, of course).

    Here are three short examples using pdfid.py against three different PDFs:

    1. the first does not cantain any JavaScript that is discovered by pdfid.py:

      $ pdfid.py nojavascript.pdf
      
       PDFiD 0.2.1  nojavascript.pdf
        PDF Header: %PDF-1.5
        obj                  193
        endobj               193
        stream                54
        endstream             54
        xref                   1
        trailer                1
        startxref              1
        /Page                  1
        /Encrypt               0
        /ObjStm                0
        /JS                    0 
        /JavaScript            0
        /AA                   12
        /OpenAction            0
        /AcroForm              1
        /JBIG2Decode           0
        /RichMedia             0
        /Launch                0
        /EmbeddedFile          0
        /XFA                   0
        /Colors > 2^24         0
      
    2. the second contains JavaScript, and the name /JavaScript appears in clear text inside the PDF:

      $ pdfid.py javascript1.pdf | grep -E '(/JS|/JavaScript)
      
        /JS                   30
        /JavaScript           30
      
    3. the last contains JavaScript, and the name tokens /JavaScript and /JS both are obfuscated:

      $ pdfid.py javascript2.pdf | grep -E '(/JS|/JavaScript)
      
        /JS                   30(30)
        /JavaScript           30(30)
      

      The fact that pdfid.py lists a second number in parentheses shows, that it discovered the obfuscation. 30 out of 30 /JavaScript name tokens are obscured -- this makes the PDF file highly suspicious, which warrants further investigation. Because no "normal" PDF generating tool (that is known to me) uses this obfuscation...


    Update

    A list of different methods (including command line tools) is available in another answer of mine here:

    • "Extract JavaScript from malicious PDF"

    The best tool currently is peepdf.py, because it can handle even heavily obfuscated JavaScript. This is a Python framework to explore (and change) the source code of PDF files, specialized in analysing malicious PDFs.

    Its author(s) recently added the extract sub-command, which extracts and prints the source code of JavaScripts contained in the PDF:

    Short usage info:

    1. Checkout the sources from GitHub:
      git clone https://github.com/jesparza/peepdf.git git.peepdf
    2. Create a symlink (which is in your $PATH) to the script:
      cd git.peepdf ;
      ln -s $(pwd)/peepdf.py ${HOME}/bin/peepdf.py
    3. Create a script file with the PeePDF subcommand to extract the javascript:
      echo 'extract js > all-javascripts-from-my.pdf' > xtract.txt
    4. Run PeePDF (setting loose parsing mode, -l, and force mode to ignore errors, -f) to execute non-interactively the sub-command line(s) contained in the newly created script file, -s:
      peepdf.py -l -f -s xtract.txt my.pdf
    5. Investigate the contents of the extracted JavaScript:
      cat all-javascripts-from-my.pdf

    0 讨论(0)
提交回复
热议问题