Decoding ima4 audio format

前端 未结 2 1858
囚心锁ツ
囚心锁ツ 2021-02-08 12:24

To reduce the download size of an iPhone application I\'m compressing some audio files. Specifically I\'m using afconvert on the command line to change .wav format to .caf form

相关标签:
2条回答
  • 2021-02-08 12:35

    After gathering all the data from Wooji-Juice, Multimedia Wiki and Apple, here is my proposal (may need some experiment):

    File structure

    • Apple IMA4 file are made of packet of 34 bytes. This is the packet unit used to build the file.
    • Each 34 bytes packet has two parts:
      • the first 2 bytes contain the preamble: an initial predictor and a step index
      • the 32 bytes left contain the sound nibbles (a nibble of 4 bits is used to retrieve a 16 bits sample)
    • Each packet has 32 bytes of compressed data, that represent 64 samples of 16 bits.
    • If the sound file is stereo, the packets are interleaved (one for the left, one for the right); there must be an even number of packets.

    Decoding

    Each packet of 34 bytes will lead to the decompression of 64 samples of 16 bits. So the size of the uncompressed data is 128 bytes per packet.

    The decoding pseudo code looks like:

    int[] ima_index_table = ... // Index table from [Multimedia Wiki][2]
    int[] step_table = ... // Step table from [Multimedia Wiki][2]
    byte[] packet = ... // A packet of 34 bytes compressed
    short[] output = ... // The output buffer of 128 bytes
    int preamble = (packet[0] << 8) | packet[1];
    int predictor = preamble && 0xFF80; // See [Multimedia Wiki][2]
    int step_index = preamble && 0x007F; // See [Multimedia Wiki][2]
    int i;
    int j = 0;
    for(i = 2; i < 34; i++) {
        byte data = packet[i];
        int lower_nibble = data && 0x0F;
        int upper_nibble = (data && 0xF0) >> 4;
    
        // Decode the lower nibble
        step_index += ima_index_table[lower_nibble];
        diff = ((signed)nibble + 0.5f) * step / 4;
        predictor += diff;
        step = ima_step_table[step index];
    
        // Clamp the predictor value to stay in range
        if (predictor > 65535)
            output[j++] = 65535;
        else if (predictor < -65536)
            output[j++] = -65536;
        else
            output[j++] = (short) predictor;
    
        // Decode the uppper nibble
        step_index += ima_index_table[upper_nibble];
        diff = ((signed)nibble + 0.5f) * step / 4;
        predictor += diff;
        step = ima_step_table[step index];
    
        // Clamp the predictor value to stay in range
        if (predictor > 65535)
            output[j++] = 65535;
        else if (predictor < -65536)
            output[j++] = -65536;
        else
            output[j++] = (short) predictor;
    }
    
    0 讨论(0)
  • 2021-02-08 13:02

    The term "packet" refers to a group of compressed audio samples with a header. You need the header to decode the data immediately following. If you consider your ima4 file to be a book, then each packet is a page. At the top are the values needed to decode that page, followed by the compressed audio.

    That's why you need to calculate the size of the unpacked data (and then make space for it) -- since it's compressed, you need to convert data from compressed audio to uncompressed audio before you can output it. In order to allocate an output buffer, you need to know how big it has to be (note: you may need to output in chunks that are larger than a single packet at a time).

    It looks like the typical structure, per the earlier "Overview" section, is that sets of 64 samples, each 16 bits (so 128 bytes) are translated to a 2-byte header and a 32-byte set of compressed samples (34 bytes in all). So, in the typical case, you can produce your expected output datasize by taking the input data size, dividing by 34 to get the number of packets, then multiplying by 128 bytes for the uncompressed audio per packet.

    You shouldn't do that, though. It looks like you should instead query kAudioFilePropertyDataFormat to get the mBytesPerPacket -- this is the "34" value above, and mFramesPerPacket -- this is the 64, above, that gets multiplied by 2 (for 16-byte samples) to make 128 bytes of output.

    Then, for each packet, you will need to run through the decoding described in the post. In somewhat longer pseudo C-code, assuming you are getting arrays of bytes, to handle the header:

    packet = GetPacket();
    Header = (packet[0] << 8) | packet[1]; //Big-endian 16-bit value
    step_index = Header & 0x007f; //Lower seven bits
    predictor = Header & 0xff80; //Upper nine bits
    for (i = 2; i < mBytesPerPacket; i++)
    {
        nibble = packet[i] & 0x0f; //Low Nibble
        process that nibble, per the blogpost -- be careful on sign-extension!
        nibble = (packet[i] & 0xf0) >> 4; //High Nibble
        process that nibble, per the blogpost -- be careful on sign-extension!
    }
    

    The sign-extension above refers to the fact that the post involves handling each nibble both in an unsigned and a signed way. If the high bit of a nibble (bit 3) is a 1, then it is negative; additionally the bit-shift may do sign-extension. This is not handled in the above pseudocode.

    0 讨论(0)
提交回复
热议问题