Fast search in compressed text files

前端 未结 5 638
旧巷少年郎
旧巷少年郎 2021-02-04 18:56

I need to be able to search for text in a large number of files (.txt) that are zipped. Compression may be changed to something else or even became proprietary. I want to avoid

5条回答
  •  面向向阳花
    2021-02-04 19:37

    Searching for text in compressed files can be faster than searching for the same thing in uncompressed text files.

    One compression technique I've seen that sacrifices some space in order to do fast searches:

    • maintain a dictionary with 2^16 entries of every word in the text. Reserve the first 256 entries for literal bytes, in case you come upon a word that isn't in the dictionary -- even though many large texts have fewer than 32,000 unique words, so they never need to use those literal bytes.
    • Compress the original text by substituting the 16-bit dictionary index for each word.
    • (optional) In the normal case that two words are separated by a single space character, discard that space character; otherwise put all the bytes in the string between words into the dictionary as a special "word" (for example, ". " and ", " and "\n") tagged with the "no default spaces" attribute, and then "compress" those strings by replacing them with the corresponding dictionary index.
    • Search for words or phrases by compressing the phrase in the same way, and searching for the compressed string of bytes in the compressed text in exactly the same way you would search for the original string in the original text.

    In particular, searching for a single word usually reduces to comparing the 16-bit index in the compressed text, which is faster than searching for that word in the original text, because

    • each comparison requires comparing fewer bytes -- 2, rather than however many bytes were in that word, and
    • we're doing fewer comparisons, because the compressed file is shorter.

    Some kinds of regular expressions can be translated to another regular expression that directly finds items in the compressed file (and also perhaps also finds a few false positives). Such a search also does fewer comparisons than using the original regular expression on the original text file, because the compressed file is shorter, but typically each regular expression comparison requires more work, so it may or may not be faster than the original regex operating on the original text.

    (In principle you could replace the fixed-length 16-bit codes with variable-length Huffman prefix codes, as rwong mentioned -- the resulting compressed file would be smaller, but the software to deal with those files would be a little slower and more complicated).

    For more sophisticated techniques, you might look at

    • MG4J: Managing Gigabytes for Java
    • "Managing Gigabytes: Compressing and Indexing Documents and Images" by Ian H. Witten, Alistair Moffat, and Timothy C. Bell

提交回复
热议问题