When encoding:
Not sure if I'll have the time/energy to follow this up with actual code, but here's my idea:
Anything longer than that, and we're starting lose information in the text. So execute the minimum number of the following steps to reduce the string to a length that can then be compressed/encoded using the above methods. Also, don't perform these replacements on the entire string if just performing them on a substring will make it short enough (I would probably walk through the string backwards).
Ok, so now we've eliminated as many excess characters as we can reasonably get rid of. Now we're going to do some more dramatic reductions:
Ok, that's about as far as we can go and have the text be readable. Beyond this, lets see if we can come up with a method so that the text will resemble the original, even if it isn't ultimately deciperable (again, perform this one character at a time from the end of the string, and stop when it is short enough):
This should leave us with a string consisting of exactly 5 possible values (a, l, n, p, and space), which should allow us to encode pretty lengthy strings.
Beyond that, we'd simply have to truncate.
Only other technique I can think of would be to do dictionary-based encoding, for common words or groups of letters. This might give us some benefit for proper sentences, but probably not for arbitrary strings.
#! perl
use strict;
use warnings;
use 5.010;
use Getopt::Long;
use Pod::Usage;
use autodie;
my %opts = (
infile => '-',
outfile => '-',
);
GetOptions (
'encode|e' => \$opts{encode},
'decode|d' => \$opts{decode},
'infile|i=s' => \$opts{infile},
'outfile|o=s' => \$opts{outfile},
'help|h' => \&help,
'man|m' => \&man,
);
unless(
# exactly one of these should be set
$opts{encode} xor $opts{decode}
){
help();
}
{
my $infile;
if( $opts{infile} ~~ ['-', '&0'] ){
$infile = *STDIN{IO};
}else{
open $infile, '<', $opts{infile};
}
my $outfile;
if( $opts{outfile} ~~ ['-', '&1'] ){
$outfile = *STDOUT{IO};
}elsif( $opts{outfile} ~~ '&2' ){
$outfile = *STDERR{IO};
}else{
open $outfile, '>', $opts{outfile};
}
if( $opts{decode} ){
while( my $line = <$infile> ){
chomp $line;
say {$outfile} $line;
}
}elsif( $opts{encode} ){
while( my $line = <$infile> ){
chomp $line;
$line =~ s/[\W_]+/ /g;
say {$outfile} $line;
}
}else{
die 'How did I get here?';
}
}
sub help{
pod2usage();
}
sub man{
pod2usage(1);
}
__END__
=head1 NAME
sample.pl - Using GetOpt::Long and Pod::Usage
=head1 SYNOPSIS
sample.pl [options] [file ...]
Options:
--help -h brief help message
--man -m full documentation
--encode -e encode text
--decode -d decode text
--infile -i input filename
--outfile -o output filename
=head1 OPTIONS
=over 8
=item B<--help>
Print a brief help message and exits.
=item B<--man>
Prints the manual page and exits.
=item B<--encode>
Removes any character other than /\w/.
=item B<--decode>
Just reads from one file, and writes to the other.
=item B<--infile>
Input filename. If this is '-' or '&0', then read from STDIN instead.
If you use '&0', you must pass it in with quotes.
=item B<--outfile>
Output filename. If this is '-' or '&1', then write to STDOUT instead.
If this is '&2', then write to STDERR instead.
If you use '&1' or '&2', you must pass it in with quotes.
=back
=head1 DESCRIPTION
B<This program> will read the given input file(s) and do something
useful with the contents thereof.
=cut
echo Hello, this is, some text | perl sample.pl -e Hello this is some text
Here is my variant for actual English.
Each code point have something like 1100000 possible states. Well, that's a lot of space.
So, we stem all original text and get Wordnet synsets from it. Numbers are cast into english names ("fourty two"). 1,1M states will allow us to hold synset id (which can be between 0 and 82114), position inside synset(~10 variants, i suppose) and synset type (which is one of four - noun, verb, adjective, adverb). We even may have enough space to store original form of word (like verb tense id).
Decoder just feeds synsets to Wordnet and retrieves corresponding words.
Source text:
A white dwarf is a small star composed mostly of electron-degenerate matter. Because a
white dwarf's mass is comparable to that of the Sun and its volume is comparable to that
of the Earth, it is very dense.
Becomes:
A white dwarf be small star composed mostly electron degenerate matter because white
dwarf mass be comparable sun IT volume be comparable earth IT be very dense
(tested with Online Wordnet). This "code" should take 27 code points. Ofcourse all "gibberish" like 'lol' and 'L33T' will be lost forever.
PAQ8O10T << FTW