Fast search in compressed text files

前端 未结 5 634
旧巷少年郎
旧巷少年郎 2021-02-04 18:56

I need to be able to search for text in a large number of files (.txt) that are zipped. Compression may be changed to something else or even became proprietary. I want to avoid

5条回答
  •  一向
    一向 (楼主)
    2021-02-04 19:48

    Most text files are compressed with one of the LZ-family of algorithms, which combine a Dictionary Coder together with an Entropy Coder such as Huffman.

    Because the Dictionary Coder relies on a continuously-updated "dictionary", its coding result is dependent on the history (all codes in the dictionary that is derived from the input data up to the current symbol), so it is not possible to jump into a certain location and start decoding, without first decoding all of the previous data.

    In my opinion, you can just use a zlib stream decoder which returns decompressed data as it goes without waiting for the entire file to be decompressed. This will not save execution time but will save memory.

    A second suggestion is to do Huffman coding on English words, and forget about the Dictionary Coder part. Each English word gets mapped to a unique prefix-free code.

    Finally, @SHODAN gave the most sensible suggestion, which is to index the files, compress the index and bundle with the compressed text files. To do a search, decompress just the index file and look up the words. This is in fact an improvement over doing the Huffman coding on words - once you found the frequency of words (in order to assign the prefix code optimally), you have already built the index, so you can keep the index for searching.

提交回复
热议问题