Lossless hierarchical run length encoding

前端 未结 2 1170
花落未央
花落未央 2020-12-30 08:39

I want to summarize rather than compress in a similar manner to run length encoding but in a nested sense.

For instance, I want : ABCBCABCBCDEEF to become: (2A(2BC

相关标签:
2条回答
  • 2020-12-30 09:05

    I'm pretty sure this isn't the best approach, and depending on the length of the patterns, might have a running time and memory usage that won't work, but here's some code.

    You can paste the following code into LINQPad and run it, and it should produce the following output:

    ABCBCABCBCDEEF = (2A(2BC))D(2E)F
    ABBABBABBABA = (3A(2B))ABA
    ABCDABCDCDCDCD = (2ABCD)(3CD)
    

    As you can see, the middle example encoded ABB as A(2B) instead of ABB, you would have to make that judgment yourself, if single-symbol sequences like that should be encoded as a repeated symbol or not, or if a specific threshold (like 3 or more) should be used.

    Basically, the code runs like this:

    1. For each position in the sequence, try to find the longest match (actually, it doesn't, it takes the first 2+ match it finds, I left the rest as an exercise for you since I have to leave my computer for a few hours now)
    2. It then tries to encode that sequence, the one that repeats, recursively, and spits out a X*seq type of object
    3. If it can't find a repeating sequence, it spits out the single symbol at that location
    4. It then skips what it encoded, and continues from #1

    Anyway, here's the code:

    void Main()
    {
        string[] examples = new[]
        {
            "ABCBCABCBCDEEF",
            "ABBABBABBABA",
            "ABCDABCDCDCDCD",
        };
    
        foreach (string example in examples)
        {
            StringBuilder sb = new StringBuilder();
            foreach (var r in Encode(example))
                sb.Append(r.ToString());
            Debug.WriteLine(example + " = " + sb.ToString());
        }
    }
    
    public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values)
    {
        return Encode<T>(values, EqualityComparer<T>.Default);
    }
    
    public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values, IEqualityComparer<T> comparer)
    {
        List<T> sequence = new List<T>(values);
    
        int index = 0;
        while (index < sequence.Count)
        {
            var bestSequence = FindBestSequence<T>(sequence, index, comparer);
            if (bestSequence == null || bestSequence.Length < 1)
                throw new InvalidOperationException("Unable to find sequence at position " + index);
    
            yield return bestSequence;
            index += bestSequence.Length;
        }
    }
    
    private static Repeat<T> FindBestSequence<T>(IList<T> sequence, int startIndex, IEqualityComparer<T> comparer)
    {
        int sequenceLength = 1;
        while (startIndex + sequenceLength * 2 <= sequence.Count)
        {
            if (comparer.Equals(sequence[startIndex], sequence[startIndex + sequenceLength]))
            {
                bool atLeast2Repeats = true;
                for (int index = 0; index < sequenceLength; index++)
                {
                    if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength + index]))
                    {
                        atLeast2Repeats = false;
                        break;
                    }
                }
                if (atLeast2Repeats)
                {
                    int count = 2;
                    while (startIndex + sequenceLength * (count + 1) <= sequence.Count)
                    {
                        bool anotherRepeat = true;
                        for (int index = 0; index < sequenceLength; index++)
                        {
                            if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength * count + index]))
                            {
                                anotherRepeat = false;
                                break;
                            }
                        }
                        if (anotherRepeat)
                            count++;
                        else
                            break;
                    }
    
                    List<T> oneSequence = Enumerable.Range(0, sequenceLength).Select(i => sequence[startIndex + i]).ToList();
                    var repeatedSequence = Encode<T>(oneSequence, comparer).ToArray();
                    return new SequenceRepeat<T>(count, repeatedSequence);
                }
            }
    
            sequenceLength++;
        }
    
        // fall back, we could not find anything that repeated at all
        return new SingleSymbol<T>(sequence[startIndex]);
    }
    
    public abstract class Repeat<T>
    {
        public int Count { get; private set; }
    
        protected Repeat(int count)
        {
            Count = count;
        }
    
        public abstract int Length
        {
            get;
        }
    }
    
    public class SingleSymbol<T> : Repeat<T>
    {
        public T Value { get; private set; }
    
        public SingleSymbol(T value)
            : base(1)
        {
            Value = value;
        }
    
        public override string ToString()
        {
            return string.Format("{0}", Value);
        }
    
        public override int Length
        {
            get
            {
                return Count;
            }
        }
    }
    
    public class SequenceRepeat<T> : Repeat<T>
    {
        public Repeat<T>[] Values { get; private set; }
    
        public SequenceRepeat(int count, Repeat<T>[] values)
            : base(count)
        {
            Values = values;
        }
    
        public override string ToString()
        {
            return string.Format("({0}{1})", Count, string.Join("", Values.Select(v => v.ToString())));
        }
    
        public override int Length
        {
            get
            {
                int oneLength = 0;
                foreach (var value in Values)
                    oneLength += value.Length;
                return Count * oneLength;
            }
        }
    }
    
    public class GroupRepeat<T> : Repeat<T>
    {
        public Repeat<T> Group { get; private set; }
    
        public GroupRepeat(int count, Repeat<T> group)
            : base(count)
        {
            Group = group;
        }
    
        public override string ToString()
        {
            return string.Format("({0}{1})", Count, Group);
        }
    
        public override int Length
        {
            get
            {
                return Count * Group.Length;
            }
        }
    }
    
    0 讨论(0)
  • 2020-12-30 09:08

    Looking at the problem theoretically, it seems similar to the problem of finding the smallest context free grammar which generates (only) the string, except in this case the non-terminals can only be used in direct sequence after each other, so e.g.

    
    ABCBCABCBCDEEF
    s->ttDuuF
    t->Avv
    v->BC
    u->E
    
    ABABCDABABCD
    s->ABtt
    t->ABCD
    
    

    Of course, this depends on how you define "smallest", but if you count terminals on the right side of rules, it should be the same as the "length in original symbols" after doing the nested run-length encoding.

    The problem of the smallest grammar is known to be hard, and is a well-studied problem. I don't know how much the "direct sequence" part adds to or subtracts from the complexity.

    0 讨论(0)
提交回复
热议问题