Twitter text compression challenge

前端 未结 4 826
一生所求
一生所求 2021-02-06 08:06

Rules

  1. Your program must have two modes: encoding and decoding.
  2. When encoding:

    1. Your p
4条回答
  •  臣服心动
    2021-02-06 08:16

    Here is my variant for actual English.

    Each code point have something like 1100000 possible states. Well, that's a lot of space.

    So, we stem all original text and get Wordnet synsets from it. Numbers are cast into english names ("fourty two"). 1,1M states will allow us to hold synset id (which can be between 0 and 82114), position inside synset(~10 variants, i suppose) and synset type (which is one of four - noun, verb, adjective, adverb). We even may have enough space to store original form of word (like verb tense id).

    Decoder just feeds synsets to Wordnet and retrieves corresponding words.

    Source text:

    A white dwarf is a small star composed mostly of electron-degenerate matter. Because a
    white dwarf's mass is comparable to that of the Sun and its volume is comparable to that 
    of the Earth, it is very dense.
    

    Becomes:

    A white dwarf be small star composed mostly electron degenerate matter because white
    dwarf mass be comparable sun IT volume be comparable earth IT be very dense
    

    (tested with Online Wordnet). This "code" should take 27 code points. Ofcourse all "gibberish" like 'lol' and 'L33T' will be lost forever.

提交回复
热议问题