I have the very common problem of creating an index for an in-disk array of strings. In short, I need to store the position of each string in the in-disk representation. For exa
You've omitted critical information about the number of strings you intend to index.
But given that you say you expect the minimum length of an indexed string to be 256, storing the indices as 64% incurs at most 3% overhead. If the total length of the string file is less than 4GB, you could use 32-bit indices and incur 1.5% overhead. These numbers suggest to me that if compression matters, you're better off compressing the strings, not the indices. For that problem a variation on LZ77 seems in order.
If you want to try a wild idea, put each string in a separate file, pull them all into a zip file, and see how you can do with zziplib. This probably won't be great, but it's nearly zero work on your part.
More data on the problem would be welcome:
gzip
The comment and revised question makes the problem much clearer. I like your idea of grouping, and I would try a simple delta encoding, group the deltas, and use a variable-length code within each group. I wouldn't wire in 64 as the group size–I think you will probably want to determine that empirically.
You asked for existing libraries. For the grouping and delta encoding I doubt you will find much. For variable-length integer codes, I'm not seeing much in the way of C libraries, but you can find variable-length codings in Perl and Python. There are a ton of papers and some patents on this topic, and I suspect you're going to wind up having to roll your own. But there are some simple codes out there, and you could give UTF-8 a try—it can code unsigned integers up to 32 bits, and you can grab C code from Plan 9 and I'm sure many other sources.