I am desperately trying to output a PDF generated by phantomJS to stdout like here
What I am getting is an empty PDF file, although it is not 0 in size, it displays a bl
When writing output to /dev/stdout/
or /dev/stderr/
on Windows, PhantomJS
goes through the following steps (as seen in the render
method in \phantomjs\src\webpage.cpp):
/dev/stdout/
and /dev/stderr/
a temporary file path is allocated.renderPdf
with the temporary file path.QByteArray
.QString::fromAscii
on the byte array and write to stdout
or stderr
.To begin with, I built the source for PhantomJS
, but commented out the file deletion. On the next run, I was able to examine the temporary file it had rendered, which turned out to be completely fine. I also tried running phantomjs.exe rasterize.js http://google.com > test.png
with the same results. This immediately ruled out a rendering issue, or anything specifically to do with PDFs, meaning that the problem had to be related to the way data is written to stdout
.
By this stage I had suspicions about whether there was some text encoding shenanigans going on. From previous runs, I had both a valid and invalid version of the same file (a PNG in this case).
Using some C# code, I ran the following experiment:
//Read the contents of the known good file.
byte[] bytesFromGoodFile = File.ReadAllBytes("valid_file.png");
//Read the contents of the known bad file.
byte[] bytesFromBadFile = File.ReadAllBytes("invalid_file.png");
//Take the bytes from the valid file and convert to a string
//using the Latin-1 encoding.
string iso88591String = Encoding.GetEncoding("iso-8859-1").GetString(bytesFromGoodFile);
//Take the Latin-1 encoded string and retrieve its bytes using the UTF-8 encoding.
byte[] bytesFromIso88591String = Encoding.UTF8.GetBytes(iso88591String);
//If the bytes from the Latin-1 string are all the same as the ones from the
//known bad file, we have an encoding problem.
Debug.Assert(bytesFromBadFile
.Select((b, i) => b == bytesFromIso88591String[i])
.All(c => c));
Note that I used ISO-8859-1 encoding as QT
uses this as the default encoding for c-strings. As it turned out, all those bytes were the same. The point of that exercise was to see if I could mimic the encoding steps that caused valid data to become invalid.
For further evidence, I investigated \phantomjs\src\system.cpp and \phantomjs\src\filesystem.cpp.
system.cpp
, the System
class holds references to, among other things, File
objects for stdout
, stdin
and stderr
, which are set up to use UTF-8
encoding.stdout
, the write
function of the File
object is called. This function supports writing to both text and binary files, but because of the way the System
class initializes them, all writing will be treated as though it were going to a text file.So the problem boils down to this: we need to be performing a binary write to stdout
, yet our writes end up being treated as text and having an encoding applied to them that causes the resulting file to be invalid.
Given the problem described above, I can't see any way to get this working the way you want on Windows without making changes to the PhantomJS
code. So here they are:
This first change will provide a function we can call on File
objects to explicitly perform a binary write.
Add the following function prototype in \phantomjs\src\filesystem.h
:
bool binaryWrite(const QString &data);
And place its definition in \phantomjs\src\filesystem.cpp
(the code for this method comes from the write
method in this file):
bool File::binaryWrite(const QString &data)
{
if ( !m_file->isWritable() ) {
qDebug() << "File::write - " << "Couldn't write:" << m_file->fileName();
return true;
}
QByteArray bytes(data.size(), Qt::Uninitialized);
for(int i = 0; i < data.size(); ++i) {
bytes[i] = data.at(i).toAscii();
}
return m_file->write(bytes);
}
At around line 920 of \phantomjs\src\webpage.cpp
you'll see a block of code that looks like this:
if( fileName == STDOUT_FILENAME ){
#ifdef Q_OS_WIN32
_setmode(_fileno(stdout), O_BINARY);
#endif
((File *)system->_stderr())->write(QString::fromAscii(name.constData(), name.size()));
#ifdef Q_OS_WIN32
_setmode(_fileno(stdout), O_TEXT);
#endif
}
Change it to this:
if( fileName == STDOUT_FILENAME ){
#ifdef Q_OS_WIN32
_setmode(_fileno(stdout), O_BINARY);
((File *)system->_stdout())->binaryWrite(QString::fromAscii(ba.constData(), ba.size()));
#elif
((File *)system->_stderr())->write(QString::fromAscii(name.constData(), name.size()));
#endif
#ifdef Q_OS_WIN32
_setmode(_fileno(stdout), O_TEXT);
#endif
}
So what that code replacement does is calls our new binaryWrite
function, but does so guarded by a #ifdef Q_OS_WIN32
block. I did it this way so as to preserve the old functionality on non-Windows systems which don't seem to exhibit this problem (or do they?). Note that this fix only applies to writing to stdout
- if you want to you could always apply it to stderr
but it may not matter quite so much in that case.
In case you just want a pre-built binary (who wouldn't?), you can find phantomjs.exe
with these fixes on my SkyDrive. My version is around 19MB whereas the one I downloaded earlier was only about 6MB, though I followed the instructions here, so it should be fine.
Yes, that's right ISO-8859-1 is the default encoding for QT so you will need to add the required parameter to the command line --output-encoding=ISO-8859-1 so the pdf output won't be corrupted
i.e.
phantomjs.exe rasterize.js --output-encoding=ISO-8859-1 < input.html > output.pdf
and rasterize.js looks like this (tested, works for both Unix and Windows)
var page = require('webpage').create(),
system = require('system');
page.viewportSize = {width: 600, height: 600};
page.paperSize = {format: 'A4', orientation: system.args[1], margin: '1cm'};
page.content = system.stdin.read();
window.setTimeout(function () {
try {
page.render('/dev/stdout', {format: 'pdf'});
}
catch (e) {
console.log(e.message + ';;' + output_file);
}
phantom.exit();
}, 1000);
or alternatively you can set encoding using stdout and if you are reading from UTF-8 stream then you might have to set encoding for stdin as well;
system.stdout.setEncoding('ISO-8859-1');
system.stdin.setEncoding('UTF-8');
page.content = system.stdin.read();