Twitter text compression challenge

前端 未结 4 831
一生所求
一生所求 2021-02-06 08:06

Rules

  1. Your program must have two modes: encoding and decoding.
  2. When encoding:

    1. Your p
4条回答
  •  你的背包
    2021-02-06 08:09

    Not sure if I'll have the time/energy to follow this up with actual code, but here's my idea:

    • Any arbitrary LATIN 1 string under a certain length could be simply encoded (not even compressed) with no loss into 140 characters. The naive estimate is 280 characters, although with the code point restrictions in the contest rules, its probably a little shorter than that.
    • Strings slightly longer than the above length (lets guestimate between 280 and 500 characters) can most likely be shrunk using standard compression techniques, into a string short enough to allow the above encoding.

    Anything longer than that, and we're starting lose information in the text. So execute the minimum number of the following steps to reduce the string to a length that can then be compressed/encoded using the above methods. Also, don't perform these replacements on the entire string if just performing them on a substring will make it short enough (I would probably walk through the string backwards).

    1. Replace all LATIN 1 characters above 127 (primarily accented letters and funky symbols) with their closest equivalent in non-accented alphabetic characters, or possibly with a generic symbol replacement like "#"
    2. Replace all uppercase letters with their equivalent lowercase form
    3. Replace all non-alphanumerics (any remaining symbols or punctuation marks) with a space
    4. Replace all numbers with 0

    Ok, so now we've eliminated as many excess characters as we can reasonably get rid of. Now we're going to do some more dramatic reductions:

    1. Replace all double-letters (balloon) with a single letter (balon). Will look weird, but still hopefully decipherable by the reader.
    2. Replace other common letter combinations with shorter equivalents (CK with K, WR with R, etc)

    Ok, that's about as far as we can go and have the text be readable. Beyond this, lets see if we can come up with a method so that the text will resemble the original, even if it isn't ultimately deciperable (again, perform this one character at a time from the end of the string, and stop when it is short enough):

    1. Replace all vowels (aeiouy) with a
    2. Replace all "tall" letters (bdfhklt) with l
    3. Replace all "short" letters (cmnrsvwxz) with n
    4. Replace all "hanging" letters (gjpq) with p

    This should leave us with a string consisting of exactly 5 possible values (a, l, n, p, and space), which should allow us to encode pretty lengthy strings.

    Beyond that, we'd simply have to truncate.

    Only other technique I can think of would be to do dictionary-based encoding, for common words or groups of letters. This might give us some benefit for proper sentences, but probably not for arbitrary strings.

提交回复
热议问题