Use Perl to Add GIF Image Other Than 8-bit to PDF

I am attempting to add non-interlaced GIF images other than 8-bit to a PDF document without having to fully decode the bitstream using PDF::Create for Perl.

The LZWDecode algorithm that is part of the PDF standard requires all images to have a minimum LZW code size of 8-bits, and PDF::Create is hard-coded to only embed 8-bit images.

So far, I have adapted the image loader from PDF::Create to read a 5-bit image and to fully decode the LZW stream. I am then able to use the encoder algorithm from PDF::Create to re-pack the image as 8-bit.

What I'd like to do is to eliminate the memory-intensive decode/encode step. This thread suggests that this is possible by "widening or shifting bits" to make LZW codes the proper length for LZWDecode.

I contacted the thread author and he provided some additional details, in particular that codes for color indices remain the same but are padded with zeros (e.g., [10000] becomes [000010000]), that <Clear> and <End> codes are changed to <256> and <257> respectively, and that all other codes are offset by 256 - original <Clear> code.

However, he was unable to elaborate further due to restrictions by his employer. In particular, I am uncertain how to handle a code when its modified value exceeds <4095> (the maximum index for the LZW code table). I am also unsure how to re-pack the revised codes into a bitstream.

The algorithms I am currently using are below.

# Read 5-bit data stream

sub ReadData5 {

    my $data = shift;

    my $c_size = 6;                # minimium LZW code size
    my $t_size = 33;               # initial code table size
    my ($i_buff,$i_bits) = (0,0);  # input buffer
    my ($o_buff,$o_bits) = (0,0);  # output buffer

    my $stream = '';               # bitstream
    my $pos    = 0;

    SUB_BLOCK: while (1){
        my $s = substr($data, $pos++, 1);

        # get sub-block size
        my $n_bytes  = unpack('C', $s) or last SUB_BLOCK;
        my $c_mask   = (1 << $c_size) - 1;

        BYTES: while (1){
            # read c_size bits
            while ($i_bits < $c_size){

                # end of sub-block
                !$n_bytes-- and next SUB_BLOCK;

                $s = substr($data, $pos++, 1);
                my $c = unpack('C', $s);

                $i_buff |= $c << $i_bits;
                $i_bits += 8;
            }

            # write c_size bits
            my $code   = $i_buff & $c_mask;

            my $w_bits = $c_size;
            $i_buff  >>= $c_size;
            $i_bits   -= $c_size;
            $t_size++;

            if ($o_bits > 0){
                $o_buff |= $code >> ($c_size - 8 + $o_bits);
                $w_bits -= 8 - $o_bits;
                $stream .= pack('C', $o_buff & 0xFF);
            }

            if ($w_bits >= 8){
                $w_bits -= 8;
                $stream .= pack('C', ($code >> $w_bits) & 0xFF);
            }

            if (($o_bits = $w_bits) > 0){
                $o_buff = $code << (8 - $o_bits);
            }

            # clear code
            if ($code == 32){
                $c_size   = 6;
                $t_size   = 33;
                $c_mask   = (1 << $c_size) - 1;
            }

            # end code
            if ($code == 33){
                $stream .= pack('C', $o_buff & 0xFF);
                last SUB_BLOCK;
            }

            if ($t_size == (1 << $c_size)){
                if (++$c_size > 12){
                    $c_size--;
                } else {
                    $c_mask = (1 << $c_size) - 1;
                }
            }
        }
    }

    # Pad with zeros to byte boundary
    $stream .= '0' x (8 - length($stream) % 8);

    return $stream;
}

#---------------------------------------------------------------------------

# Decode 5-bit data stream

sub UnLZW5 {
    my $data = shift;

    my $c_size = 6;                 # minimium LZW code size
    my $t_size = 33;                # initial code table size
    my ($i_buff,$i_bits) = (0,0);   # input buffer

    my $stream = '';                # bitstream
    my $pos    = 0;

    # initialize code table
    my @table  = map { chr($_) } 0..$t_size-2;
    $table[32] = '';
    my $prefix = '';
    my $suffix = '';

    # get first code word
    while ($i_bits < $c_size){
        my $d     = unpack('C', substr($data, $pos++, 1));
        $i_buff   = ($i_buff << 8) + $d;
        $i_bits += 8;
    }

    my $c2     = $i_buff >> ($i_bits - $c_size);
    $i_bits   -= $c_size;
    my $c_mask = (1 << $i_bits) - 1;
    $i_buff   &= $c_mask;

    # get remaining code words
    DECOMPRESS: while ($pos < length($data)){
        my $c1 = $c2;

        while ($i_bits < $c_size){
            my $d     = unpack('C', substr($data, $pos++, 1));
            $i_buff   = ($i_buff << 8) + $d;
            $i_bits  += 8;
        }

        $c2      = $i_buff >> ($i_bits - $c_size);
        $i_bits -= $c_size;
        $c_mask  = (1 << $i_bits) - 1;
        $i_buff &= $c_mask;

        # clear code
        if ($c2 == 32){
            $stream  .= $table[$c1];
            $#table   = 32;
            $c_size   = 6;
            $t_size   = 33;
            next DECOMPRESS;
        }

        # end code
        if ($c2 == 33){
            $stream .= $table[$c1];
            last DECOMPRESS;
        }

        # get prefix and suffix
        $prefix = $table[$c1] if $c1 < $t_size;
        $suffix = $c2 < $t_size ? substr($table[$c2], 0, 1) : substr($prefix, 0, 1);

        # write prefix
        $stream .= $prefix;

        # write multiple-character sequence
        $table[$t_size++] = $prefix . $suffix;

        # increase code size
        if ($t_size == 2 ** $c_size){
            if (++$c_size > 12){
                $c_size--;
            }
        }
    }

    return $stream;
}

Doing one at a time is slow. Doing them all at once takes too much memory for you. Do them a chunk at a time.

my $BUFFER_SIZE = 5 * 50_000;  # Must be a multiple of 5.

my $in_bytes = ...;
my $out_bytes = '';
while (my ($bytes) = $in_bytes =~ s/^(.{1,$BUFFER_SIZE})//s) {
   # Unpack from 5 bit fields.
   my @vals = map { pack('B*', "000$_") } unpack('B*', $bytes) =~ /(.{5})/g;

   # Transform @vals into 8 bit values here.

   # Pack to 8 bit fields.
   $out_bytes .= pack('C*', @vals);

}

Since you're not transforming the values at all (just how they are stored), that simplifies to:

my $BUFFER_SIZE = 5 * 50_000;  # Must be a multiple of 40.

my $in_bytes = ...;
my $out_bytes = '';
while (my ($bytes) = $in_bytes =~ s/^(.{1,$BUFFER_SIZE})//s) {
   # Unpack from 5 bit fields.
   my $bits = unpack('B*', $bytes);
   $bits =~ s/(.{5})/000$1/g;
   $out_bytes .= pack('B*', $bits);

}

You didn't say what to do with the extra bits. I simply ignored them.

Alternative approach with no bit string creation:

my $in_bytes = ...;
my $out_bytes = '';
while (my ($bytes) = $in_bytes =~ s/^(.{1,5})//s) {
    my @bytes = map ord, split //, $bytes;

    # 00000111 11222223 33334444 45555566 66677777

    $out_bytes .= chr(                            (($bytes[0] >> 3) & 0x1F));
    last if @bytes == 1;
    $out_bytes .= chr((($bytes[0] << 2) & 0x1C) | (($bytes[1] >> 6) & 0x03));
    $out_bytes .= chr(                            (($bytes[1] >> 1) & 0x1F));
    last if @bytes == 2;
    $out_bytes .= chr((($bytes[1] << 4) & 0x10) | (($bytes[2] >> 4) & 0x0F));
    last if @bytes == 3;
    $out_bytes .= chr((($bytes[2] << 1) & 0x1E) | (($bytes[3] >> 7) & 0x01));
    $out_bytes .= chr(                            (($bytes[3] >> 2) & 0x1F));
    last if @bytes == 4;
    $out_bytes .= chr((($bytes[3] << 3) & 0x18) | (($bytes[4] >> 5) & 0x07));
    $out_bytes .= chr(                            ( $bytes[4]       & 0x1F));
}

The advantage of the above solution is that it's particularly efficient in C.

STRLEN in_len;
const char* in = SvPVbyte(sv, in_len);

STRLEN out_len = (in_len * 8 / 5) * 8;
char* out = (char*)malloc(out_len);

char* out_cur = out;
char* in_end = in + in_len;

while (in != in_end) {
    *(out_cur++) =                          ((*in >> 3) & 0x1F));
    if (++in == in_end) break;
    *(out_cur++) = ((in[-1] << 2) & 0x1C) | ((*in >> 6) & 0x03));
    *(out_cur++) =                          ((*in >> 1) & 0x1F));
    if (++in == in_end) break;
    *(out_cur++) = ((in[-1] << 4) & 0x10) | ((*in >> 4) & 0x0F));
    if (++in == in_end) break;
    *(out_cur++) = ((in[-1] << 1) & 0x1E) | ((*in >> 7) & 0x01));
    *(out_cur++) =                          ((*in >> 2) & 0x1F));
    if (++in == in_end) break;
    *(out_cur++) = ((in[-1] << 3) & 0x18) | ((*in >> 5) & 0x07));
    *(out_cur++) =                          ( *in       & 0x1F));
}

return newSVpvn(out, out_len);

来源：https://stackoverflow.com/questions/11108579/use-perl-to-add-gif-image-other-than-8-bit-to-pdf

标签

perl

pdf

gif

bit-shift

lzw