问题
I am attempting to add non-interlaced GIF images other than 8-bit to a PDF document without having to fully decode the bitstream using PDF::Create for Perl.
The LZWDecode
algorithm that is part of the PDF standard requires all images to have a minimum LZW code size of 8-bits, and PDF::Create
is hard-coded to only embed 8-bit images.
So far, I have adapted the image loader from PDF::Create
to read a 5-bit image and to fully decode the LZW stream. I am then able to use the encoder algorithm from PDF::Create
to re-pack the image as 8-bit.
What I'd like to do is to eliminate the memory-intensive decode/encode step. This thread suggests that this is possible by "widening or shifting bits" to make LZW codes the proper length for LZWDecode
.
I contacted the thread author and he provided some additional details, in particular that codes for color indices remain the same but are padded with zeros (e.g., [10000]
becomes [000010000]
), that <Clear>
and <End>
codes are changed to <256>
and <257>
respectively, and that all other codes are offset by 256 - original <Clear>
code.
However, he was unable to elaborate further due to restrictions by his employer. In particular, I am uncertain how to handle a code when its modified value exceeds <4095>
(the maximum index for the LZW code table). I am also unsure how to re-pack the revised codes into a bitstream.
The algorithms I am currently using are below.
# Read 5-bit data stream
sub ReadData5 {
my $data = shift;
my $c_size = 6; # minimium LZW code size
my $t_size = 33; # initial code table size
my ($i_buff,$i_bits) = (0,0); # input buffer
my ($o_buff,$o_bits) = (0,0); # output buffer
my $stream = ''; # bitstream
my $pos = 0;
SUB_BLOCK: while (1){
my $s = substr($data, $pos++, 1);
# get sub-block size
my $n_bytes = unpack('C', $s) or last SUB_BLOCK;
my $c_mask = (1 << $c_size) - 1;
BYTES: while (1){
# read c_size bits
while ($i_bits < $c_size){
# end of sub-block
!$n_bytes-- and next SUB_BLOCK;
$s = substr($data, $pos++, 1);
my $c = unpack('C', $s);
$i_buff |= $c << $i_bits;
$i_bits += 8;
}
# write c_size bits
my $code = $i_buff & $c_mask;
my $w_bits = $c_size;
$i_buff >>= $c_size;
$i_bits -= $c_size;
$t_size++;
if ($o_bits > 0){
$o_buff |= $code >> ($c_size - 8 + $o_bits);
$w_bits -= 8 - $o_bits;
$stream .= pack('C', $o_buff & 0xFF);
}
if ($w_bits >= 8){
$w_bits -= 8;
$stream .= pack('C', ($code >> $w_bits) & 0xFF);
}
if (($o_bits = $w_bits) > 0){
$o_buff = $code << (8 - $o_bits);
}
# clear code
if ($code == 32){
$c_size = 6;
$t_size = 33;
$c_mask = (1 << $c_size) - 1;
}
# end code
if ($code == 33){
$stream .= pack('C', $o_buff & 0xFF);
last SUB_BLOCK;
}
if ($t_size == (1 << $c_size)){
if (++$c_size > 12){
$c_size--;
} else {
$c_mask = (1 << $c_size) - 1;
}
}
}
}
# Pad with zeros to byte boundary
$stream .= '0' x (8 - length($stream) % 8);
return $stream;
}
#---------------------------------------------------------------------------
# Decode 5-bit data stream
sub UnLZW5 {
my $data = shift;
my $c_size = 6; # minimium LZW code size
my $t_size = 33; # initial code table size
my ($i_buff,$i_bits) = (0,0); # input buffer
my $stream = ''; # bitstream
my $pos = 0;
# initialize code table
my @table = map { chr($_) } 0..$t_size-2;
$table[32] = '';
my $prefix = '';
my $suffix = '';
# get first code word
while ($i_bits < $c_size){
my $d = unpack('C', substr($data, $pos++, 1));
$i_buff = ($i_buff << 8) + $d;
$i_bits += 8;
}
my $c2 = $i_buff >> ($i_bits - $c_size);
$i_bits -= $c_size;
my $c_mask = (1 << $i_bits) - 1;
$i_buff &= $c_mask;
# get remaining code words
DECOMPRESS: while ($pos < length($data)){
my $c1 = $c2;
while ($i_bits < $c_size){
my $d = unpack('C', substr($data, $pos++, 1));
$i_buff = ($i_buff << 8) + $d;
$i_bits += 8;
}
$c2 = $i_buff >> ($i_bits - $c_size);
$i_bits -= $c_size;
$c_mask = (1 << $i_bits) - 1;
$i_buff &= $c_mask;
# clear code
if ($c2 == 32){
$stream .= $table[$c1];
$#table = 32;
$c_size = 6;
$t_size = 33;
next DECOMPRESS;
}
# end code
if ($c2 == 33){
$stream .= $table[$c1];
last DECOMPRESS;
}
# get prefix and suffix
$prefix = $table[$c1] if $c1 < $t_size;
$suffix = $c2 < $t_size ? substr($table[$c2], 0, 1) : substr($prefix, 0, 1);
# write prefix
$stream .= $prefix;
# write multiple-character sequence
$table[$t_size++] = $prefix . $suffix;
# increase code size
if ($t_size == 2 ** $c_size){
if (++$c_size > 12){
$c_size--;
}
}
}
return $stream;
}
回答1:
Doing one at a time is slow. Doing them all at once takes too much memory for you. Do them a chunk at a time.
my $BUFFER_SIZE = 5 * 50_000; # Must be a multiple of 5.
my $in_bytes = ...;
my $out_bytes = '';
while (my ($bytes) = $in_bytes =~ s/^(.{1,$BUFFER_SIZE})//s) {
# Unpack from 5 bit fields.
my @vals = map { pack('B*', "000$_") } unpack('B*', $bytes) =~ /(.{5})/g;
# Transform @vals into 8 bit values here.
# Pack to 8 bit fields.
$out_bytes .= pack('C*', @vals);
}
Since you're not transforming the values at all (just how they are stored), that simplifies to:
my $BUFFER_SIZE = 5 * 50_000; # Must be a multiple of 40.
my $in_bytes = ...;
my $out_bytes = '';
while (my ($bytes) = $in_bytes =~ s/^(.{1,$BUFFER_SIZE})//s) {
# Unpack from 5 bit fields.
my $bits = unpack('B*', $bytes);
$bits =~ s/(.{5})/000$1/g;
$out_bytes .= pack('B*', $bits);
}
You didn't say what to do with the extra bits. I simply ignored them.
Alternative approach with no bit string creation:
my $in_bytes = ...;
my $out_bytes = '';
while (my ($bytes) = $in_bytes =~ s/^(.{1,5})//s) {
my @bytes = map ord, split //, $bytes;
# 00000111 11222223 33334444 45555566 66677777
$out_bytes .= chr( (($bytes[0] >> 3) & 0x1F));
last if @bytes == 1;
$out_bytes .= chr((($bytes[0] << 2) & 0x1C) | (($bytes[1] >> 6) & 0x03));
$out_bytes .= chr( (($bytes[1] >> 1) & 0x1F));
last if @bytes == 2;
$out_bytes .= chr((($bytes[1] << 4) & 0x10) | (($bytes[2] >> 4) & 0x0F));
last if @bytes == 3;
$out_bytes .= chr((($bytes[2] << 1) & 0x1E) | (($bytes[3] >> 7) & 0x01));
$out_bytes .= chr( (($bytes[3] >> 2) & 0x1F));
last if @bytes == 4;
$out_bytes .= chr((($bytes[3] << 3) & 0x18) | (($bytes[4] >> 5) & 0x07));
$out_bytes .= chr( ( $bytes[4] & 0x1F));
}
The advantage of the above solution is that it's particularly efficient in C.
STRLEN in_len;
const char* in = SvPVbyte(sv, in_len);
STRLEN out_len = (in_len * 8 / 5) * 8;
char* out = (char*)malloc(out_len);
char* out_cur = out;
char* in_end = in + in_len;
while (in != in_end) {
*(out_cur++) = ((*in >> 3) & 0x1F));
if (++in == in_end) break;
*(out_cur++) = ((in[-1] << 2) & 0x1C) | ((*in >> 6) & 0x03));
*(out_cur++) = ((*in >> 1) & 0x1F));
if (++in == in_end) break;
*(out_cur++) = ((in[-1] << 4) & 0x10) | ((*in >> 4) & 0x0F));
if (++in == in_end) break;
*(out_cur++) = ((in[-1] << 1) & 0x1E) | ((*in >> 7) & 0x01));
*(out_cur++) = ((*in >> 2) & 0x1F));
if (++in == in_end) break;
*(out_cur++) = ((in[-1] << 3) & 0x18) | ((*in >> 5) & 0x07));
*(out_cur++) = ( *in & 0x1F));
}
return newSVpvn(out, out_len);
来源:https://stackoverflow.com/questions/11108579/use-perl-to-add-gif-image-other-than-8-bit-to-pdf