问题
I know similar questions have been asked, but this is kind of a trivial case.
Given a text file endcoded with a substitution cipher, I need to decode it using python. I am not given any examples of correctly deciphered words. The relationship is 1-to-1 and case doesn't make a difference. Also, punctuation isn't changed and spaces are left where they are. I don't need help with the code as much as I need help with a general idea of how this could be done in code. My main approaches involve:
- Narrowing down the choices by first solving 1, 2 or 3 character words.
- I could use an list of English words of different sizes to compare.
- I could use frequency distributions of the letters.
Does anyone have an idea of a general approach I could take to do this?
回答1:
I would first get a list of English words for reference. Next construct a list of possible 2 and 3 letter words. Then just start testing those small words in your cipher. Once you guess at a small word, check the larger words against your word list. If some of the words no longer have possible completions in the list, you're on the wrong track. If a word only has one possible completion, accept it as correct and continue. Eventually, you'll either reach a solution where all words are in your English word list, or you'll reach a point where there is no solution for a word.
回答2:
I wrote something like this for when Haley's speech was all garbled. It wasn't automagic though; it made guesses based on etaoinshrdlu (the most frequently used letters in English, sorted most to least) and let the user interactively change the meaning of a given ciphertext letter.
So it would show you something like:
t0is is a 12eat 34556e!
and you'd manually guess what letter each number represented until you had something legible.
The advantage of this approach is that it can tolerate typos. If your encryptor makes any errors (or uses any words not in your dictionary in the plaintext) you may find yourself with an unsolveable puzzle.
That said, spell checkers have great lists of English words. I used the one in Debian's dictionaries-common package for my hangman solver.
回答3:
You could try this approach:
Store a list of valid words (in a dictionary) and a "normal" letter distibution for your language (in a list).
Calculate the distribution of the letters in the garbled text.
Compare your garbled distribution with the normal one and regarble your text according to that.
Repeat: Set an array (rank) from all 26 letters to float (rank('A')=rank('B')=...=rank('Z')=0.0)
Check the words in the produced text against words in the dictionary. If a word is in the dictionary, raise the rank of that word's letters (something like: add a standard value, say 1.0). In other words calculate Score (a function of total rank and number of words in dictionary).
Save text into High score table (if score high enough).
If all words are in the dictionary or if the total rank is high enough or if the loop was done more than 10000 times, End.
If not, choose randomly two letters and interchange them. But with a deviated distribution, letters with high rank should have less chances of being interchanged.
Repeat.
End: Print High score texts.
The procedure resembles Simulated Annealing
来源:https://stackoverflow.com/questions/5587280/solving-a-substitution-cipher-with-python