decoding algorithm wanted

前端 未结 5 1275
礼貌的吻别
礼貌的吻别 2021-01-21 06:55

I receive encoded PDF files regularly. The encoding works like this:

  • the PDFs can be displayed correctly in Acrobat Reader
  • select all and copy the test vi
5条回答
  •  走了就别回头了
    2021-01-21 07:32

    Is the PDF going to contain symbolic (like math or proofs) or natural language text (English, French, etc)?

    If the latter, you can use a frequency chart for letters (digraphs, trigraphs and a small dictionary of words if you want to go the distance). I think there are probably a few of these online. Here's a start. And more specifically letter frequencies.

    Then, if you're sure it's a Caesar shift, you can grab the first 1000 characters or so and shift them forward by increasing amounts up to (I would guess) 127 or so. Take the resulting texts and calculate how close the frequencies match the average ones you found above. Here is information on that.

    The linked letter frequencies page on Wikipedia shows only letters, so you may want to exclude them in your calculation, or better find a chart with them in it. You may also want to transform the entire resulting text into lowercase or uppercase (your preference) to treat letters the same regardless of case.

    Edit - saw comment about character swapping

    In this case, it's a substitution cipher, which can still be broken automatically, though this time you will probably want to have a digraph chart handy to do extra analysis. This is useful because there will quite possibly be a substitution that is "closer" to average language in terms of letter analysis than the correct one, but comparing digraph frequencies will let you rule it out.

    Also, I suggested shifting the characters, then seeing how close the frequencies matched the average language frequencies. You can actually just calculate the frequencies in your ciphertext first, then try to line them up with the good values. I'm not sure which is better.

提交回复
热议问题