Fast search in compressed text files

前端未结

关注

 5  634

旧巷少年郎 2021-02-04 18:56

I need to be able to search for text in a large number of files (.txt) that are zipped. Compression may be changed to something else or even became proprietary. I want to avoid

5条回答

一向 (楼主)

2021-02-04 19:48

Most text files are compressed with one of the LZ-family of algorithms, which combine a Dictionary Coder together with an Entropy Coder such as Huffman.

Because the Dictionary Coder relies on a continuously-updated "dictionary", its coding result is dependent on the history (all codes in the dictionary that is derived from the input data up to the current symbol), so it is not possible to jump into a certain location and start decoding, without first decoding all of the previous data.

In my opinion, you can just use a zlib stream decoder which returns decompressed data as it goes without waiting for the entire file to be decompressed. This will not save execution time but will save memory.

A second suggestion is to do Huffman coding on English words, and forget about the Dictionary Coder part. Each English word gets mapped to a unique prefix-free code.

Finally, @SHODAN gave the most sensible suggestion, which is to index the files, compress the index and bundle with the compressed text files. To do a search, decompress just the index file and look up the words. This is in fact an improvement over doing the Huffman coding on words - once you found the frequency of words (in order to assign the prefix code optimally), you have already built the index, so you can keep the index for searching.

0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...