How to save a .pdf from a browser?

I tried to save .pdf file using different methods I found on stackoverflow including FileUtils IO, however, I would always get it damaged. As I opened the damaged file using a notepad I got the following stuff:

<HEAD>

    <TITLE>
        09010b129fasdf558a-
    </TITLE>

</HEAD>


<HTML>

<SCRIPT language="javascript" src="./js/windowClose.js"></SCRIPT>

<LINK href="./theme/default.css" rel="stylesheet" type="text/css">
<LINK href="./theme/additions.css" rel="stylesheet" type="text/css">

<BODY leftmargin="0" topmargin="0">

<TABLE cellpadding="0" cellspacing="0" width="100%">
    <TR>
        <TD class="mainSectionHeader">
            <A href="javascript:windowClose()" class="allLinks">
                CLOSE
            </A>
        </TD>

    </TR>

</TABLE>

                <script language='javaScript'>
                    alert('Session timed out. Please login again.\n');
                    window.close();
                </script>



</BODY>

</HTML>

Later, I tried to save a .pdf file from a browser using the answer provided by @BalusC. This solution is very helpful: I was able to get rid of the session issues. However, it also produces a damaged .pdf. But as I open it with a notepad, it is completely different. There are no login issues anymore though:

<HTML>

    <HEAD>

        <TITLE>
            Evidence System
        </TITLE>

    </HEAD>

<LINK href="./theme/default.css" rel="stylesheet" type="text/css">

<TABLE cellpadding="0" cellspacing="0" class="tableWidth760" align="center">
    <TR>
        <TD class="headerTextCtr">
            Evidence System
        </TD>
    </TR>
    <TR>
        <TD colspan="2">
            <HR size="1" noshade>
        </TD>
    </TR>
    <TR>
        <TD colspan="2">



<HTML>
<HEAD>
<link href="./theme/default.css" rel="stylesheet" type="text/css">
<script language="JavaScript">

function trim(str)
{
    var trmd_str

    if(str != "")
    {
        trmd_str = str.replace(/\s*/, "")
        if (trmd_str != ""){

            trmd_str = trmd_str.replace(/\s*$/, "")
        }

    }else{
        trmd_str = str
    }
    return trmd_str
}  

function validate(frm){
    //check for User name 
    var msg="";
    if(trim(frm.userName.value)==""){
        msg += "Please enter your user id.\n";
        frm.userName.focus();
    }

    if(trim(frm.password.value)==""){
        msg += "Please enter your password.\n";
        frm.userName.focus();
    }

    if (trim(msg)==""){
        frm.submit();
    }else{
        alert(msg);
    }
}

function numCheck(event,frm){
    if( event.keyCode == 13){
            validate(frm);  
    }
}

</script>
</HEAD>

<BODY onLoad="document.frmLogin.userName.focus();">

<FORM name='frmLogin' method='post' action='./ServletVerify'>
    <TABLE width="100%" cellspacing="20">
        <tr>
            <td class="mainTextRt">
                Username
                <input type="text" name="userName" maxlength="32" tabindex="1" value="" 
                onKeyPress="numCheck(event,this.form)" class="formTextField120">
            </TD>
            <td class="mainTextLt">
                Password
                <input type="password" name="password" maxlength="32" tabindex="2" value="" 
                onKeyPress="numCheck(event,this.form)" class="formTextField120">
            </TD>
        </TR>

        <tr>                    
            <td colspan="2" class="mainTextCtr" style="color:red">
                Unknown Error
            </td>
        </tr>

        <tr>
            <td colspan="2" class="mainTextCtr">
                <input type="button" tabindex="3" value="Submit" onclick="validate(this.form)" >
            </TD>
        </TR>
    </TABLE>

    <INPUT TYPE="hidden" NAME="actionFlag" VALUE="inbox">
</FORM>

</BODY>
</HTML>

        </TD>
    </TR>
    <TR>
        <TD height="2"></TD>
    </TR>
    <TR>
        <TD colspan="2">
            <HR size="1" noshade>
        </TD>
    </TR>
    <TR>
        <TD colspan="2">
            <LINK href="./theme/default.css" rel="stylesheet" type="text/css">

<TABLE width="80%" align="center" cellspacing="0" cellpadding="0">
    <TR>
        <TD class="footerSubtext">
            Evidence Management System
        </TD>
    </TR>

    <!-- For development builds, change the date accordingly when sending EAR files out to Wal-Mart -->
    <TR>
        <TD class="footerSubtext">
            Build:&nbsp;&nbsp;v3.1
        </TD>
    </TR>

</TABLE>
        </TD>
    </TR>
</TABLE>

</HTML>

What other options do I have?

PS: When I try to save the file manually using CTRL+Shift+S , the file gets saved OK.

From the errorneous response which appears to be just a HTML error page:

alert('Session timed out. Please login again.\n');

It thus appears that downloading the PDF file is required to take place in a valid HTTP session. The HTTP session is backed by a cookie. The HTTP session in turn contains in the server side usually information about the currenty active and/or logged-in user.

The Selenium web driver manages cookies all by itself fully transparently. You can obtain them programmatically as follows:

Set<Cookie> cookies = driver.manage().getCookies();

When manually fiddling with java.net.URL outside control of Selenium, you should be making sure yourself that the URL connection is using the same cookies (and thus also maintaining the same HTTP session). You can set cookies on the URL connection as follows:

URLConnection connection = new URL(driver.getCurrentUrl()).openConnection();

for (Cookie cookie : driver.manage().getCookies()) {
    String cookieHeader = cookie.getName() + "=" + cookie.getValue();
    connection.addRequestProperty("Cookie", cookieHeader);
}

InputStream input = connection.getInputStream(); // Write this to file.

ddavison

A PDF is considered a Binary File and it gets corrupted because the way that copyUrlToFile() works. By the way, this looks like a duplicate of JAVA - Download Binary File (e.g. PDF) file from Webserver

Try this custom binary download method out -

public void downloadBinaryFile(String path) {
    URL u = new URL(path);
    URLConnection uc = u.openConnection();
    String contentType = uc.getContentType();
    int contentLength = uc.getContentLength();
    if (contentType.startsWith("text/") || contentLength == -1) {
      throw new IOException("This is not a binary file.");
    }
    InputStream raw = uc.getInputStream();
    InputStream in = new BufferedInputStream(raw);
    byte[] data = new byte[contentLength];
    int bytesRead = 0;
    int offset = 0;
    while (offset < contentLength) {
      bytesRead = in.read(data, offset, data.length - offset);
      if (bytesRead == -1)
        break;
      offset += bytesRead;
    }
    in.close();

    if (offset != contentLength) {
      throw new IOException("Only read " + offset + " bytes; Expected " + contentLength + " bytes");
    }

    String filename = u.getFile().substring(filename.lastIndexOf('/') + 1);
    FileOutputStream out = new FileOutputStream(filename);
    out.write(data);
    out.flush();
    out.close();
}

EDIT: It actually sounds as if you are not on the page that you think you are.. instead of doing driver.getCurrentUrl()

Have your script take the Url from the link to the PDF. Assuming there is a link like <a href='http://mysite.com/my.pdf' /> Instead of clicking it, then getting the url, just take the href from that link, and download it.

String pdfPath = driver.findElement(By.id("someId")).getAttribute("href");
downloadBinaryFile(pdfPath);

sbridges

The server may be compressing the pdf. You can use this code, stolen from this answer to detect and decompress the response from the server,

InputStream is = driver.getCurrentUrl().openStream();
try {
   InputStream decoded = decompressStream(is);
   FileOutputStream output = new FileOutputStream(
       new File("C:\\Users\\myDocs\\myfolder\\myFile.pdf"));
   try {
       IOUtils.copy(decoded, output);
   }
   finally {
       output.close();
   }
} finally {
   is.close();
}

public static InputStream decompressStream(InputStream input) {
     PushBackInputStream pb = new PushBackInputStream( input, 2 ); //we need a pushbackstream to look ahead
     byte [] signature = new byte[2];
     pb.read( signature ); //read the signature
     pb.unread( signature ); //push back the signature to the stream
     if( signature[ 0 ] == (byte) 0x1f && signature[ 1 ] == (byte) 0x8b ) //check if matches standard gzip maguc number
       return new GZIPInputStream( pb );
     else 
       return pb;
}

Petr Janeček

When I try to save the file manually using CTRL+Shift+S , the file gets saved OK.

While I advocate using Java to download the file, there is a not-so-recommended workaround that presses Ctrl+Shift+S programatically: The Robot class.

It sucks to use a workaround, but it works reliably as far as I can tell in the browsers and OSes I tried. This code should not get into any serious application. But it's OK for tests if you won't be able to solve your issue the right way.

Robot robot = new Robot();

Press Ctrl+Shift+S

robot.keyPress(KeyEvent.VK_CONTROL);
robot.keyPress(KeyEvent.VK_SHIFT);
robot.keyPress(KeyEvent.VK_S);
robot.keyRelease(KeyEvent.VK_S);
robot.keyRelease(KeyEvent.VK_SHIFT);
robot.keyRelease(KeyEvent.VK_CONTROL);

In browsers and OSes I know, you should be in the Save file dialogue in the File name input. You can type in your absolute path:

robot.keyPress(KeyEvent.VK_C);        // C
robot.keyRelease(KeyEvent.VK_C);
robot.keyPress(KeyEvent.VK_COLON);    // : (colon)
robot.keyRelease(KeyEvent.VK_COLON);
robot.keyPress(KeyEvent.VK_SLASH);    // / (slash)
robot.keyRelease(KeyEvent.VK_SLASH);
// etc. for the whole file path

robot.keyPress(KeyEvent.VK_ENTER);    // confirm by pressing Enter in the end
robot.keyRelease(KeyEvent.VK_ENTER);

To get the keycodes, you can use KeyEvent#getExtendedKeyCodeForChar() (Java 7+ only), or How can I make Robot type a `:`? and Convert String to KeyEvents.

来源：https://stackoverflow.com/questions/19059769/how-to-save-a-pdf-from-a-browser

标签

java

selenium

fileutils