Text packing algorithm

和自甴很熟 提交于 2019-11-28 11:21:06
j_random_hacker

This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.

As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.

Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.

I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.

Qubeuc

I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).

My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.

Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.

I did a lab back in college where we tasked with implementing a simple compression program.

What we did was sequentially apply these techniques to text:

  • BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
  • MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
  • Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols

Here, I found the assignment page.

To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.

Refine step 3.

  • Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
  • If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
  • If no, add word to end of list as in current step 3.

This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).

I would not reinvent this wheel yet another time. There has already gone an enormous amount of manpower into compression algorithms, why not take one of the already available ones?

Here are a few good choices:

  • gzip for fast compression / decompression speed
  • bzip2 for a bit bitter compression but much slower decompression
  • LZMA for very high compression ratio and fast decompression (faster than bzip2 but slower than gzip)
  • lzop for very fast compression / decompression

If you use Java, gzip is already integrated.

It's not clear what do you want to do.

Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?

Do you just want an array of words, compressed?

In the first case, you can go for a patricia trie or a String B-Tree.

For the second case, you can just adopt some index compression techinique, like that:

If you have something like:

aaa 
aaab
aasd
abaco
abad

You can compress like that:

0aaa
3b
2sd
1baco
2ad

The number is the length of the largest common prefix with the preceding string. You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!