C Library for compressing sequential positive integers

前端 未结 6 1504
难免孤独
难免孤独 2021-02-05 19:39

I have the very common problem of creating an index for an in-disk array of strings. In short, I need to store the position of each string in the in-disk representation. For exa

相关标签:
6条回答
  • 2021-02-05 20:05

    I did something similar years ago for a full-text search engine. In my case, each indexed word generated a record which consisted of a record number (document id) and a word number (it could just as easily have stored word offsets) which needed to be compressed as much as possible. I used a delta-compression technique which took advantage of the fact that there would be a number of occurrences of the same word within a document, so the record number often did not need to be repeated at all. And the word offset delta would often fit within one or two bytes. Here is the code I used.

    Since it's in C++, the code may is not going to be useful to you as is, but can be a good starting point for writing compressions routines.

    Please excuse the hungarian notation and the magic numbers strewn within the code. Like I said, I wrote this many years ago :-)

    IndexCompressor.h

    //
    // index compressor class
    //
    
    #pragma once
    
    #include "File.h"
    
    const int IC_BUFFER_SIZE = 8192;
    
    //
    // index compressor
    //
    class IndexCompressor
    {
    private :
       File        *m_pFile;
       WA_DWORD    m_dwRecNo;
       WA_DWORD    m_dwWordNo;
       WA_DWORD    m_dwRecordCount;
       WA_DWORD    m_dwHitCount;
    
       WA_BYTE     m_byBuffer[IC_BUFFER_SIZE];
       WA_DWORD    m_dwBytes;
    
       bool        m_bDebugDump;
    
       void FlushBuffer(void);
    
    public :
       IndexCompressor(void) { m_pFile = 0; m_bDebugDump = false; }
       ~IndexCompressor(void) {}
    
       void Attach(File& File) { m_pFile = &File; }
    
       void Begin(void);
       void Add(WA_DWORD dwRecNo, WA_DWORD dwWordNo);
       void End(void);
    
       WA_DWORD GetRecordCount(void) { return m_dwRecordCount; }
       WA_DWORD GetHitCount(void) { return m_dwHitCount; }
    
       void DebugDump(void) { m_bDebugDump = true; }
    };
    

    IndexCompressor.cpp

    //
    // index compressor class
    //
    
    #include "stdafx.h"
    #include "IndexCompressor.h"
    
    void IndexCompressor::FlushBuffer(void)
    {
       ASSERT(m_pFile != 0);
    
       if (m_dwBytes > 0)
       {
          m_pFile->Write(m_byBuffer, m_dwBytes);
          m_dwBytes = 0;
       }
    }
    
    void IndexCompressor::Begin(void)
    {
       ASSERT(m_pFile != 0);
       m_dwRecNo = m_dwWordNo = m_dwRecordCount = m_dwHitCount = 0;
       m_dwBytes = 0;
    }
    
    void IndexCompressor::Add(WA_DWORD dwRecNo, WA_DWORD dwWordNo)
    {
       ASSERT(m_pFile != 0);
       WA_BYTE buffer[16];
       int nbytes = 1;
    
       ASSERT(dwRecNo >= m_dwRecNo);
    
       if (dwRecNo != m_dwRecNo)
          m_dwWordNo = 0;
       if (m_dwRecordCount == 0 || dwRecNo != m_dwRecNo)
          ++m_dwRecordCount;
       ++m_dwHitCount;
    
       WA_DWORD dwRecNoDelta = dwRecNo - m_dwRecNo;
       WA_DWORD dwWordNoDelta = dwWordNo - m_dwWordNo;
    
       if (m_bDebugDump)
       {
          TRACE("%8X[%8X] %8X[%8X] : ", dwRecNo, dwRecNoDelta, dwWordNo, dwWordNoDelta);
       }
    
       // 1WWWWWWW
       if (dwRecNoDelta == 0 && dwWordNoDelta < 128)
       {
          buffer[0] = 0x80 | WA_BYTE(dwWordNoDelta);
       }
       // 01WWWWWW WWWWWWWW
       else if (dwRecNoDelta == 0 && dwWordNoDelta < 16384)
       {
          buffer[0] = 0x40 | WA_BYTE(dwWordNoDelta >> 8);
          buffer[1] = WA_BYTE(dwWordNoDelta & 0x00ff);
          nbytes += sizeof(WA_BYTE);
       }
       // 001RRRRR WWWWWWWW WWWWWWWW
       else if (dwRecNoDelta < 32 && dwWordNoDelta < 65536)
       {
          buffer[0] = 0x20 | WA_BYTE(dwRecNoDelta);
          WA_WORD *p = (WA_WORD *) (buffer+1);
          *p = WA_WORD(dwWordNoDelta);
          nbytes += sizeof(WA_WORD);
       }
       else
       {
          // 0001rrww
          buffer[0] = 0x10;
    
          // encode recno
          if (dwRecNoDelta < 256)
          {
             buffer[nbytes] = WA_BYTE(dwRecNoDelta);
             nbytes += sizeof(WA_BYTE);
          }
          else if (dwRecNoDelta < 65536)
          {
             buffer[0] |= 0x04;
             WA_WORD *p = (WA_WORD *) (buffer+nbytes);
             *p = WA_WORD(dwRecNoDelta);
             nbytes += sizeof(WA_WORD);
          }
          else
          {
             buffer[0] |= 0x08;
             WA_DWORD *p = (WA_DWORD *) (buffer+nbytes);
             *p = dwRecNoDelta;
             nbytes += sizeof(WA_DWORD);
          }
    
          // encode wordno
          if (dwWordNoDelta < 256)
          {
             buffer[nbytes] = WA_BYTE(dwWordNoDelta);
             nbytes += sizeof(WA_BYTE);
          }
          else if (dwWordNoDelta < 65536)
          {
             buffer[0] |= 0x01;
             WA_WORD *p = (WA_WORD *) (buffer+nbytes);
             *p = WA_WORD(dwWordNoDelta);
             nbytes += sizeof(WA_WORD);
          }
          else
          {
             buffer[0] |= 0x02;
             WA_DWORD *p = (WA_DWORD *) (buffer+nbytes);
             *p = dwWordNoDelta;
             nbytes += sizeof(WA_DWORD);
          }
       }
    
       // update current setting
       m_dwRecNo = dwRecNo;
       m_dwWordNo = dwWordNo;
    
       // add compressed data to buffer
       ASSERT(buffer[0] != 0);
       ASSERT(nbytes > 0 && nbytes < 10);
       if (m_dwBytes + nbytes > IC_BUFFER_SIZE)
          FlushBuffer();
       CopyMemory(m_byBuffer + m_dwBytes, buffer, nbytes);
       m_dwBytes += nbytes;
    
       if (m_bDebugDump)
       {
          for (int i = 0; i < nbytes; ++i)
             TRACE("%02X ", buffer[i]);
          TRACE("\n");
       }
    }
    
    void IndexCompressor::End(void)
    {
       FlushBuffer();
       m_pFile->Write(WA_BYTE(0));
    }
    
    0 讨论(0)
  • 2021-02-05 20:07

    Are you running on Windows? If so, I recommend creating the mmap file using naive solution your originally proposed, and then compressing the file using NTLM compression. Your application code never knows the file is compressed, and the OS does the file compression for you. You might not think this would be very performant or get good compression, but I think you'll be surprised if you try it.

    0 讨论(0)
  • 2021-02-05 20:11

    You've omitted critical information about the number of strings you intend to index.

    But given that you say you expect the minimum length of an indexed string to be 256, storing the indices as 64% incurs at most 3% overhead. If the total length of the string file is less than 4GB, you could use 32-bit indices and incur 1.5% overhead. These numbers suggest to me that if compression matters, you're better off compressing the strings, not the indices. For that problem a variation on LZ77 seems in order.

    If you want to try a wild idea, put each string in a separate file, pull them all into a zip file, and see how you can do with zziplib. This probably won't be great, but it's nearly zero work on your part.

    More data on the problem would be welcome:

    • Number of strings
    • Average length of a string
    • Maximum length of a string
    • Median length of strings
    • Degree to which the strings file compresses with gzip
    • Whether you are allowed to change the order of strings to improve compression

    EDIT

    The comment and revised question makes the problem much clearer. I like your idea of grouping, and I would try a simple delta encoding, group the deltas, and use a variable-length code within each group. I wouldn't wire in 64 as the group size–I think you will probably want to determine that empirically.

    You asked for existing libraries. For the grouping and delta encoding I doubt you will find much. For variable-length integer codes, I'm not seeing much in the way of C libraries, but you can find variable-length codings in Perl and Python. There are a ton of papers and some patents on this topic, and I suspect you're going to wind up having to roll your own. But there are some simple codes out there, and you could give UTF-8 a try—it can code unsigned integers up to 32 bits, and you can grab C code from Plan 9 and I'm sure many other sources.

    0 讨论(0)
  • 2021-02-05 20:15

    You have two conflicting requirements:

    1. You want to compress very small items (8 bytes each).
    2. You need efficient random access for each item.

    The second requirement is very likely to impose a fixed length for each item.

    0 讨论(0)
  • 2021-02-05 20:24

    I use fastbit (Kesheng Wu LBL.GOV), it seems you need something good, fast and NOW, so fastbit is a highly competient improvement on Oracle's BBC (byte aligned bitmap code, berkeleydb). It's easy to setup and very good gernally.

    However, given more time, you may want to look at a gray code solution, it seems optimal for your purposes.

    Daniel Lemire has a number of libraries for C/++/Java released on code.google, I've read over some of his papers and they are quite nice, several advancements on fastbit and alternative approaches for column re-ordering with permutated grey codes's.

    Almost forgot, I also came across Tokyo Cabinet, though I do not think it will be well suited for my current project, I may of considered it more if I had known about it before ;), it has a large degree of interoperability,

    Tokyo Cabinet is written in the C language, and provided as API of C, Perl, Ruby, Java, and Lua. Tokyo Cabinet is available on platforms which have API conforming to C99 and POSIX.

    As you referred to CDB, the TC benchmark has a TC mode (TC support's several operational constraint's for varying perf) where it surpassed CDB by 10 times for read performance and 2 times for write.

    With respect to your delta encoding requirement, I am quite confident in bsdiff and it's ability to out-perform any file.exe content patching system, it may also have some fundimental interfaces for your general needs.

    Google's new binary compression application, courgette may be worth checking out, in case you missed the press release, 10x smaller diff's than bsdiff in the one test case I have seen published.

    0 讨论(0)
  • 2021-02-05 20:24

    What exactly are you trying to compress? If you are thinking about the total space of index, is it really worth the effort to save the space?

    If so one thing you could try is to chop the space into half and store it into two tables. First stores (upper uint, start index, length, pointer to second table) and the second would store (index, lower uint).

    For fast searching, indices would be implemented using something like B+ Tree.

    0 讨论(0)
提交回复
热议问题