|
There hasn't been at lot of published information on the analysis of the d'Agapeyeff cipher. The most systematic approach is possibly the 1978 four page article that surfaced in the Cryptologia[1] Journal. This small and engaging article was reprinted, a few years later, in book form[2]. On the other hand, there are a few scarse sources of information scattered on the internet. The wikipedia article on the cipher, for example, doesn't have much information of use on the analysis of the cipher. However it has some references to a few pages that comment on the cipher. It has also been reported [3, 4] that 2 articles were published on The Cryptogram Magazine[9, 10], on 1952 and 1959 respectively, though I've been unable to track them (Please note this is no longer true. A visitor has kindly sent me a copy of the two cryptogram articles. Please see the last section of this page for a comment on those).
A note to the readers: Most of the the facts and assumptions listed on this page have already been noted elsewhere from the references mentioned. I make no claim to have discovered them unless otherwise stated. In order to differentiate what I believe are original observations based on my own analysis, I represent them in a different color so that someone familiar with the cipher can skip the other parts and jump straight into my angle of view. Also note that when I say original I mean I haven't seen it published before, not that nobody has already thought of that. Also, all calculations were independently done by me so please leave me a note if you notice any mistakes. Cipher Analysis
The cipher itself is presented in 5 digit groups, for a total of 395 digits:
The patristocrat form of 5 digit grouping is commonly used to facilitate the transmission of telegraphic data and usualy bears no special meaning on the cipher construction. As noted on d'Agapeyeff work:
It is however, immediately visible to the alert reader that the numbers, when read by rows, alternate in pairs of digits where the first is either a 6, 7, 8, 9 or 0 and the second 1, 2, 3, 4 or 5. The only exception to this rule are the last three zeros at the end of the cryptogram which are probably nulls used to fill the last 5 digit group. d'Agapeyeff himself recommends on his book that this filling is completed:
Based on the assumptions made, we can now reformat the cipher in 2 letter groups and removing the trailing 0's at the end. This yields a total of 392 digits (196 pairs).
The 6-0 and 1-5 grouping for first and second digits in the pairs is highly suggestive of the usage of a 5x5 Polybius Square for substitution. This yields a 25 letter set available for the plaintext, where I/J or U/V are usually coalesced to fit the whole English alphabet (d'Agapeyeff uses the I/J merge in all examples of his book). The filling of the square may be done either randomly or keyed at the beginning with a word or phrase, omitting eventually repeated letters and filling the rest of the square with unused letters in aphabetical order.
Doing a basic frequency count on the 196 pairs we have the following:
Frequency distribution of pairs
And representing the frequency of each pair in a Polybius square we have
Computing the Index of Coincidence (IC) for the pairs with different periods we have (for the first 19)
Independently of the period, the IC is suggestive of a simple monoalphabetic substitution (~0,067 for English text when compared to the expected ~0,0385 of a random distribution).
Even though it is conceivable the substitution could have been manipulated so to elude a cryptanalyst (more on this later), there is no way the substitution is polyalphabetic with such a distribution.
Substituting the pairs with a random alphabet and running it with through a stochastic heuristic algorithm gives no coherent plaintext. It must be noticed the used algorithm has been tested successfuly taking about 3-5 seconds to find the correct key with ciphertexts a quarter the size of the d'Agappeyef cipher. Needless to say, if it was indeed a simple monoalphabetic substitution the cipher would have been successfully solved long ago.
Assuming the substitution isn't the only factor involved, and as it was noted on [Barker](1), the 196 pairs of digits can be arranged in a 14x14 matrix. This may suggest some sort of a transposition has been applied.
Before we move on though. Although the 14x14 matrix provides a tempting square matrix, it should also be noted that other possible rectangular matrices are viable and should not be disconsidered - 28x7 and 7x28. Even though there is nothing prohibiting the use of an incomplete columnar transposition, such a method combined with substitution would be nearly impossible to find which would be odd as a challenge to the reader. As we'll note, the combination of a substitution with a regular transposition can in itself be very challenging when not enough ciphertext is available. Lets rearrange the ciphertext as is commonly suggested, in a 14x14 matrix.
75 62 82 85 91 62 91 64 81 64 91 74 85 84
64 74 74 82 84 83 81 63 81 81 74 74 82 62
64 75 83 82 84 91 75 74 65 83 75 75 75 93
63 65 65 81 63 81 75 85 75 75 64 62 82 92
85 74 63 82 75 74 83 81 65 81 84 85 64 85
64 85 85 63 82 72 62 83 62 81 81 72 81 64
63 75 82 81 64 83 63 82 85 81 63 63 63 04
74 81 91 91 84 63 85 84 65 64 85 65 62 94
62 62 85 91 85 91 74 91 72 75 64 65 75 71
65 83 62 64 74 81 82 84 62 82 64 91 81 93
65 62 64 84 84 91 83 85 74 91 81 65 72 74
83 83 85 82 83 64 62 72 62 65 62 83 75 92
72 63 82 82 72 72 83 82 85 84 75 82 81 83
72 84 62 82 83 75 81 64 75 74 85 81 62 92
Now, what I believe is one of the most interesting facts mentioned by [Barker](1), is that all of the pairs that surface only 3 times or less in the ciphertext (that is 71, 92, 93, 94 and 04), all appear only on the last column. This fact is statisticaly too relevant to be ignored and should be given careful thought. Also, the only pair that has a 0 digit (04) stands exactly halfway minus 1 through the ciphertext when read by rows.
75 62 82 85 91 62 91 64 81 64 91 74 85 84
64 74 74 82 84 83 81 63 81 81 74 74 82 62
64 75 83 82 84 91 75 74 65 83 75 75 75 93
63 65 65 81 63 81 75 85 75 75 64 62 82 92
85 74 63 82 75 74 83 81 65 81 84 85 64 85
64 85 85 63 82 72 62 83 62 81 81 72 81 64
63 75 82 81 64 83 63 82 85 81 63 63 63 04
74 81 91 91 84 63 85 84 65 64 85 65 62 94
62 62 85 91 85 91 74 91 72 75 64 65 75 71
65 83 62 64 74 81 82 84 62 82 64 91 81 93
65 62 64 84 84 91 83 85 74 91 81 65 72 74
83 83 85 82 83 64 62 72 62 65 62 83 75 92
72 63 82 82 72 72 83 82 85 84 75 82 81 83
72 84 62 82 83 75 81 64 75 74 85 81 62 92 Regarding this fact, it seems appropriate to say that either a 14x14 or 7x28 (columns x rows) is used, in order to preserve the statistical oddity of leaving the low frequency characters on the last column only. A 28x7 grid would interleave the values between the 14th and 28th columns respectively.
As far as I've been able to research, the observations on the cipher stop here, with some [3,4] suggesting there is a columnar transposition and substitution that must be solved simultaneously. [Matthews][3] also offers some estimation of the search space for a brute force attack assuming we're dealing with a straight out-of-the-box substitution followed by a transposition. We'll take a break to recalculate the possible set of solutions so as to clarify why I don't agree with Matthews calculations.
Solution Space
For a permutation of the 14 columns you face a 14! (14 factorial) solution space, which is precisely 87.178.291.200 possible transpositions. This alone, at 10 transpositions per second would surmount 276 years of computation for the permutation alone.
On the other side, the substitution search space is far greater than the permutation. Matthews used the binomial coefficient which gives us all unordered combinations of the possible 25 letter alphabets given by the 5x5 Polybius square. But this still does not equate to the full solution space of the substitution phase because you'd still have to try all possible 18! permutations for each of the solutions found. Starting from [Matthews][3] approach we have:
where n is the number of characters of our alphabet and k the number of different digit pairs present in the cipher (18 distinct pairs). Thus giving us a set of 480.700 (nearly 5 x 10 ^ 5) unordered possible alphabets. Multiplying this by the 18! possible permutations of each alphabet, we end up having a search space exceeding 3 x 10 ^ 21 for the substitution phase. A way to calculate the k-permutations (a.k.a. sequences without repetition) directly is using the falling factorial, represented by:
Multiplying both possible solution sets (substitution and permutation solution spaces) gives a grand total of more than 286 x 10 ^ 30 (that's 286 followed by 30 zeros) trials to sweep the whole set of possibilities. Naturaly this is unfeasible even with parallel and distributed computing, so other approaches need to be taken in account.
Since it is impossible to follow a brute-force approach we should therefore try to look at the problem from other angles. It should however be noticed that although the solution space of either the substitution or the transposition are unfeasible, both types of encryption are trivial to solve (for the given length of 196 characters) when used alone. From my experience with stochastic combinatorial optimization methods like random restart hill-climbing and simulated annealing, both methods (when not combined) can be solved in less than a second to a few seconds in a vanilla desktop. I reserve the details on this methods to a future section of the site as to avoid encumbering this page with the technicalities, but suffice to say other methods in this class e.g. genetic algorithms have a good potential to solve this problems. The problem with optimization is usualy not the algorithm itself but the fitness function. While it is relatively easy to model a scoring mechanism when optimizing for a single goal (solving a substitution or permutation), e.g. by scoring the frequency of trigraphs or tetragraphs, this approach, however, breaks down when both types are combined.
Some further remarks
Geting back to the cipher, it is useful to keep track of the assumptions so far:
Assumption #1: The last three '0's of the cipher are 'nulls' to complete the last 5 digit group
Assumption #2: We're dealing with a digraphic cipher where each pair of digits is a coordinate in a 5x5 substitution matrix (I/J sharing the same coordinate)
Assumption #3: The 196 coordinates may be distributed on a 14x14 matrix where a transposition is applied
Let's follow up with a few more assumptions I believe may be worth pursuing. The low frequency pairs that only appeared on the last column (3 x 92, 2 x 93 and 1 x 94) are not only curious because of the fact that are no used in the other columns but also for their adjency in the Polybius square. This may be indicative that in fact this are also nulls used to fill the 14x14 matrix. This observation is coherent with d'Agapeyeff description of the type of characters that should be used as nulls - unusual letters so as to not cause perturbation on the meaning of the cipher.
A second observation. If we're dealing with a columnar transposition of the matrix, it is usual to write the cipher text by column before applying the transposition.
In other words, our matrix should be transposed:
75 64 64 63 85 64 63 74 62 65 65 83 72 72 62 74 75 65 74 85 75 81 62 83 62 83 63 84 82 74 83 65 63 85 82 91 85 62 64 85 82 62 85 82 82 81 82 63 81 91 91 64 84 82 82 82 91 84 84 63 75 82 64 84 85 74 84 83 72 83 62 83 91 81 74 72 83 63 91 81 91 64 72 75 91 81 75 75 83 62 63 85 74 82 83 62 83 81 64 63 74 85 81 83 82 84 91 84 85 72 82 64 81 81 65 75 65 62 85 65 72 62 74 62 85 75 64 81 83 75 81 81 81 64 75 82 91 65 84 74 91 74 75 64 84 81 63 85 64 64 81 62 75 85 74 74 75 62 85 72 63 65 65 91 65 83 82 81 85 82 75 82 64 81 63 62 75 81 72 75 81 62 84 62 93 92 85 64 04 94 71 93 74 92 83 92 Now, if the assumptions are right, then these columns (with the green highlighted characters) should be adjacent to each other and stand on the last columns of the matrix. Note that no comments are made as of the order they should be placed.
The '04' pair, being the only one having a '0' in the first coordinate may also be indicative of a final termination mark (e.g. an 'X' character after the substitution). It is also strange that the only other character which appear only once '71' also appears on the last row though i'd rather not include it in the probable nulls, for now.
So, we complete the assumption #3 and keep track of a new one:
Assumption #3: The 196 coordinates may be distributed on a 14x14 matrix filled by columns where a transposition is applied
Assumption #4: the low frequency pairs on the last 2 rows of the polybius square are respectively nulls to allow the complete filling of the 14x14 matrix (92, 93 and 94) and a possible termination character for the message (04).
In this case, then d'Agapeyeff would be similar to the ADFG(V)X cipher but with diferent indices for the Polybius square and the transposition being done to the full pairs and not to individual coordinates.
With this assumptions it is possible to break the cipher into two manageable parts (7x14). A 7 column columnar permutation is brute forcible and allows a stochastic optimization method to be used on all resulting ciphertext under a few minutes. I've tried this approach without much success though. However, it should also be noted that if we consider the '92', '93' and '94' coordinate pairs as nulls and '04' as a termination character, then the unique number of characters, assuming a monoalphabetic substituion, is only 14. Trying to build a 196 character message in English with only 14 characters would however deviate from the normal distribution of letters and thus easily fool the scoring mechanism used (log tetragraphs). It should also be noticed that breaking the cipher in two parts also affects the plaintext scoring algorithm in two other ways: You need to score 7 characters at a time (instead of a continuous 196 tetragraph run) and shorten the available ciphertext in half.
A run of the above method with a sample ciphertext created from known plaintext extracted from the first chapter of "Alice in the wonderland" (Carroll, Lewis) has been able to find the correct permutation+substitution on the top 5 of the highest scored solutions (for more information please refer to my Attack strategies page). I find that this method deserve a few more tests bearing possible improvements, before puting it aside.
It is also worthwile to consider that if we include the '71' in the possible nulls, then we are left with 13 different unique characters in the message (half-alphabet). If each of this 13 characters stands for 2 digits then the observable distribuition for monoalphabetic substitution would be mantained while completely destroying any attacks based on simple substitution.
The Cryptogram articles (updated April 11)
On April 4th, a visitor sent me an email with a copy of the two cryptogram articles I had trouble to find. I must say I was pleasantly surprised to read them because even though they were quite short, they did have a few comments worth of interest and some assumptions very much like some I had came up by myself. I wil not reproduce the articles because of copyright issues but allow me to summarize a few aspects (namely the details which have not been commonly reproduced elsewhere):
Both articles were written by the same person who signed them with the pen-name AB STRUSE.
1952 Article "THE d'AGAPEYEFF CRYPTOGRAM: A CHALLENGE" [9]
- the author offered a reward for the solution of the cipher
- it is suggested the 04 duplet might have been a typographical error or a null
- the reader is called to notice that right before the atypical 04 the pair 63 is repeated three times consecutively
- another suggestion points out that the cipher might be homophonic (a single pair representing more than one plaintext letter)
1959 Article "D'AGAPEYEFF CIPHER POSTSCRIPT" [10] - the cipher was still unsolved in spite of the reward, and only two individuals corresponded with the editor
- first correspondent suggested a 4-square was used
- second correspondent, an unnamed army capitain, made the observation that eight of the numbers might be nulls and only 13 pairs might actualy correspond to the 26 alphabet letters
- the editor also makes some notes on the trigraph frequencies and repetitions
Of all the above what surprised me was the army captain's observations which are very much in line with my own suspicions - a transposition substitution cipher on a 14x14 grid and several nulls at the end. Also, the possibility of an homophonic cipher where each pair represents 2 letters of the alphabet.
Page created: March 20, 2010 by Tiago Rodrigues Update History: April 11, 2010 References:
(1) Wayne G. Barker, The Unsolved d'Agapeyeff Cipher, Cryptologia, Vol. 2, No. 2, pp. 144-147
(2) C. A. Deavours, D. Kahn, L. Kruh, G. Mellen, and B. Winkel, Eds. 1989 Cryptology: Machines, History and Methods. Artech House, Inc., pp. 476-479
(3) Matthews, Robert, Notes on the D'Agapeyeff cipher, http://robertmatthews.org/Dagapeyeff.html
(4) Pelling, Nick, The d'Agapeyeff Cipher..., http://www.ciphermysteries.com/2008/05/11/the-dagapeyeff-cipher
(5) Pelling, Nick, The d'Agapeyeff Cipher, revisited..., http://www.ciphermysteries.com/2008/05/14/the-dagapeyeff-cipher-revisited
(7) d'Agapeyeff, Alexander, Codes and Ciphers - A History of Cryptography, (London: Oxford University Press, 1939)
(8) Pelling, Nick, The d'Agapeyeff Cipher, continued..., http://www.ciphermysteries.com/2008/05/26/the-dagapayeff-cipher-continued |




