How to Break a Code (Not a Cipher)

A student of cryptography would have an experience of having been impressed by the way a cryptogram can be deciphered without a key, as demonstrated by Edgar Allan Poe's The Gold Bug (see also another article, which takes Cornwallis' cipher as an example). But this is actually breaking a cipher (as opposed to a code). With cipher, which turns letters in a message into different letters, frequency of occurrence and other characteristics of the 26 letters of the alphabet allow one to methodically identify the original letters represented by the letters in a ciphertext.

On the other hand, code groups represent whole words or names. Although there are some known characteristics of words (e.g., "the" is the most frequently used word in English), characteristics among tens of thousands of words do not give as clear a clue as those among the 26 letters of the alphabet. The present article describes some examples of breaking such a code as opposed to a cipher.


Partial Encoding

Cipher in Code

German Code (1918)

Mansfield Dictionary Code (1930s)

Commercial Code

Statistical Analysis

Codebreaking by Espionage

Partial Encoding

When only some particular words are encoded, it would be a matter of knowledge of the background to guess the words referred to. Such has been done by historians such as Coxe (cf. another article), Bergenroth (cf. another article)), and others. It would have been only slightly more difficult for contemporaries to do the same.

Examples during the Peninsular War

Three examples of careless encoding are given in A History of the Peninsular War, Volumne V (Internet Archive), p. 611, Appendix XV.

When Dorsenne wrote to Jourdan on 16 April 1812 "Vous voulez de renseignement sur la situation militaire et administrative de 1238", it was easy to guess the code number stood for "the Army of the North."

When Joseph, Napoleon's brother made King of Spain, wrote to Marmont "J'ai donné l'ordre au général Treillard de 117.8.7 la vallée de 1383, afin de marcher à 498.", the situation indicated that 117.8.7 was almost certainly "evacuate", 1383 "Tagus", and 498 some large town, which still baffled Wellington to make a wrong guess "Plasencia" when it was "Aranjuez".

When Suchet wrote to Soult on 17 September 1812 "Le Général Maitland commande l'expédition anglaise venue de 747: O'Donnell peut réunir 786 692 1102 en y comprenant le corps de l'Anglais Roche. Le 19 août je n'avais que 135 692 1102 à a lui opposer." it was clear that "747" was "Sicilly"; two instances of "692 1102" are "thousand men." Further identifications that 135 stood for 7 and 786 for 12 would be ascertained through further instances. (p.613-614)

Even if longer passages or sentences are encoded, as long as some part of the whole message is in plaintext, the latter provides a clue for codebreakers, who would eventually reveal the code, given sufficient materials.

Mark Urban, The Man Who Broke Napoleon's Codes mentions such examples in passim.

Examples of Bazeries

One French letter in cipher from 1813 (see another article) had a passage "Tant par l'effet de la 107 138 170 122 53 171 122 149 et de la 148 54 53 138 169 du 6 95 107 176 que par le 177 169 161 20 très 69 145 51 115 176 qu'elle a à faire". For Bazeries, it needed no imagination to guess the ending reads "que par le se-r-vi-ce très pé-ni-b-l-e (ou fa-ti-ga-n-t) qu'elle a à faire." (Bazeries, Les chiffres secrets dévoilés, p.182)

Other examples are given in Bazeries, Les chiffres de Napoléon Ier. p.21-22. For example, after having identified "71" as "de" and "637" as "ser", it was easy for Bazeries to see that "On s'est 386.996.110. tous les jours pour 593 po 294.637 - 117 - 595 mais il n'y a rien 477 - 71 - 637.887.874." represents "On s'est battu tous les jours pour repousser l'ennemi mais il n'y a rien eu de sérieux."

Cipher in Code

In modern parlance, a code is to represent words and phrases with figures (or other symbols), whereas a cipher is to represent letters with symbols (e.g., different letters, figures). Still, many codes had provisions to represent words or names not included in the vocabulary. That is, some part of code may constitute a set of cipher symbols, which may give a clue for codebreakers.

Codebreaking by John Wallis

The English mathematician John Wallis (1616-1703) solved many letters in cipher (see another article), which included ones completely in code (see another article). Probably, Wallis detected that low numbers represented single letters and attacked sequences of such low numbers first, which would have been similar to breaking a simple cipher. When words represented in cipher were revealed, he must have used them as a clue to guess the meaning of other code groups.

Codebreaking by Etienne Bazeries

Commandant Bazeries decoded Louis XIV's great cipher with entries up to 587 (see another article). Upon first inspection of the letters in code, Bazeries was sure he could solve it. Since a code with such a small vocabulary has to encode syllable by syllable, once an initial breakthrough is made, similar reasoning as in cipher-breaking would reveal the code numbers one after another.

The great cipher before Bazeries did not have the weakness presumably exploited by John Wallis. So, Bazeries assumed that words "les ennemis" must appear in the letters and found that frequently occurring patterns with slight variation like 124 22 125 46 574 would correspond to "les en-ne-mi-s." This provided the breakthrough needed to proceed.

Stripping Superencipherment of a WWI Code

At the time of World War I, it was recognized by code compilers that "Words spelled out, letter by letter, ... are one of the favorite points of attack by enemy code men" (see another article). But the first American field code was not prepared for such an attack.

A report of 17 May 1918 (Friedman, "American Army Field Codes in the American Expeditionary Forces during the First World War", p.117, Appendix 10) for evaluating the first American field code, given 44 short test messages encoded with monoalphabetically enciphered three-letter code groups and the underlying codebook, successfully identified the superencipherment.

Code groups for single letters provided the first clue. In particular, TKG, which appeared in succession as follows, as well as those around them were considered to represent single letters.

    BCN TKG TKG BCN

    BCN TKG TKG GRO

TWS BCN TKG TKG GWY

TWS BCN TKG TKG GWY

After trying T and S in vain, L was assumed for TKG, which suggested a likely candidate ED to follow L-L. To check this assumption, reference was had to the codebook, which indicated MIH for L and HEG for ED. Matching between the third letter for the code group for L and the first letter for the code group for ED is consistent with the enciphered code group "TKG GWY". Once this was confirmed, the preceding groups were considered to be "K-I", forming the word "killed". (By the way, the report pointed out absence of the word "killed" and other frequent words as defects of the code.)

The reports also pointed out monoalphabetic enciphering provides no security because it does not matter for codebreakers whether L is represented by MIH or superenciphered as TKG. (Of course, superencipherment can conceal the one-part nature of the code but after some identifications were made, it only adds some complications but not a true difficulty of a two-part code.)

The report also pointed out that once the codebook is known, monoalphabetically enciphered "AVA" was immediately recognized to be the codegroup KYK for H because this was the only one having the same initial and final letters for single letters.

The report claimed it took only five hours to reveal the enciphering alphabet and another five hours to complete the decoding.

German Code (1918)

Herbert O. Yardley's The American Black Chamber (Chapter VI) tells how he decoded two mysterious messages transmitted from a German station in 1918. The messages were in five-figure code, without address or signature, and were transmitted over and over again nearly every day. There was reason to believe that it was addressed to Mexico.

The message No. 1 had 141 code groups and No. 42 had 138 code groups. The code groups ranged from 00308 to 48001.

Of these, the most frequent group 42635 occurred 16 times and 28709, 28223, and 19707 occurred 8 times each. It may be tempting to identify 42635 with "the" but it must be remembered that words such as "a" and "the" were often omitted in telegraphy. Yardley identified that the language was English rather than German or Spanish because assumption of German or Spanish "leads to a blank wall".

The largest numbers were 55927, 55934, 55936, of which the first two occurred twice and thus may be considered to be common words. Since they were towards the end of the alphabet, they might begin with y. Assuming "you" for 55927, 55934 "your" and 55936 "yourself" nicely fit the sequence in the only dictionary Yardley had at hand.

55927 you
55928 young
55929 younger
55930 youngish
55931 youngling
55932 youngster
55933 younker
55934 your
55935 yours
55936 yourself

Then, it was noticed that the last two figures of the code groups were all in the range from 01 to 62. This seemed to suggest that the five-figure numbers were not serial numbers but indicated a three-figure page number and a two-figure word or line number.

Then, Yardley examined the most frequent code group 42635 by listing what came before and after this code group. The results showed that it was often preceded by the same group but was always followed by a different group. Yardley thought it must be a termination or ending of some sort. But neither a period (.) nor any common words were likely because 16 occurrences in a total of no more than 279 words were too many. Considering that the place of 42635 in a sequence up to about 56000 suggested r or s as the initial letter, Yardley assumed it might be a plural ending "s".

Then, with these findings, the beginning of the telegram No. 42 was examined: "19707 21206 31511 31259". Of these, 19707, being one of the groups that occur 8 times, must be a common word. Searching pages around p.197 in some English dictionary of about 600 pages, "for" was found on p.203. Yardley's dictionary appeared to advance by about 6 pages than the actual code used by the Germans. Thus, in order to decode the next group 21206, pages around p.212+6=218 were searched and "German" was found on p.217. Then, pages around page 315+6=321 were searched to identify 31511. Though it had to be born in mind that the same offset might not apply after some hundred pages, it must be a word beginning with m. Indeed, "minister" on p.312 seemed right.

The next group 31259 should be found 3 pages earlier. The telegram was supposed to be addressed to Mexico. So "For German Minister Mexico" seemed to be the plaintext for the first four code groups.

One might continue such a process to gradually reveal code groups one by one. Alternatively, one may identify the dictionary that has "for" as the 7th line or word on p.197, etc. As it turned out, it was the English-French half of Clifton's Nouveau Dictionnaire Français.

The decoded message is as follows. Words and names not in the dictionary such as Bleichroeder (Wikipedia) seem to be represented by portions of words. (Association of code groups with plaintext portions is my conjecture. Code groups such as 28223, 28709 seem to have some predefined function rather than mere nulls but details are not known to me.)


Mansfield Dictionary Code (1930s)

As demonstrated by Yardley's example above, as long as assignment of plaintext words to numbers is in alphabetical order, the relative position of a number allows estimating an approximate place in a dictionary. Mansfield's progressive dictionary list is a reference tool to facilitate such estimation.

Lanaki (1996), Classical Cryptography Course, Lecture 20: Codes demonstrates solution of a very short code message. (I made some corrections in the following.) (It cites [DAGA] D'agapeyeff, Alexander, "Codes and Ciphers," Oxford University Press, London, 1974; [MANS] Mansfield, Louis C. S., "The Solution of Codes and Ciphers", Alexander Maclehose & Co., London, 1936; and [MAN1] Mansfield, L.C.S, "One Hundred Problems in Cipher. London, 1936. According to a posting "Mansfield Dictionary Code", this example appears to have been created by D'agapeyeff by picking up code groups of Mansfield to suit his demonstration.

The message in code is as follows.

55381 42872 35284 55381 45174 56037 55381 46882 23171 44234 55366 55381 00723 12050 61571 36173 55381 56442

Arranging the above code groups in ascending order results in

00723 12050 23171 35284 36173 42872 44234 45174 46882 55366 55381 (5 times)
56037 56442 61571

The group "55381", which occurs no less than five times in this short message, is assumed to be THE. (A possibility that THE is omitted in the message or that 5 times in such a short message is too many was not considered, at least for a starter.) The highest number "61571" may be assumed to be a word beginning with W. ("You" may also have been a possibility.)

Now, use is made of a tool called Mansfield's progressive dictionary lists, which provide serial numbers for words beginning with any two letters in dictionaries of 10,000-100,000 words. For example, given the total number of words in the dictionary would be around 65,000, the list tells that the last word beginning with DA has a serial number 11646 and the last word beginning with DE has a serial number 12850, which implies the group 12050 is a word beginning with DE. (Today, a calculator makes things easier. If you have a dictionary of say 800 pages, the page 120 out of a total of 650 would correspond to around page 148 (=120/650*800) of your dictionary.)

Working similarly for the other code numbers, we may have the following:

THE RE-- OF-- THE RO-- TO-- THE SE-- HA-- RE-- TH-- THE AE-- DE-- WA-- OV-- THE TO--

While, of course, the border between bigrams varies depending on the particular dictionary used, at least the position of THE (55381) is confirmed. Somewhat preceding this is 55366 (TH--). Of candidates such as THAN, THANK, THAT, etc., THAT would be the most probable. Two groups 56037 and 56442 are in the TO-- section, which starts at 56037 and ends at 56466. So the former would be TO. The latter, may be TOWN. Similarly, the R- words may be guessed by considering, e.g., 42872 is about 300 words from the end of the RA section at 42573.

THE RECONNAISSANCE OF-- THE ROUTE TO THE SE-- HA-- REVEALED THAT THE AE-- DE-- WA-- OV-- THE TOWN.

Similarly, AE-- may be AEROPLANE. And DE-- is one quarter of the way into the DE section, which suggests DEF--. Thus, "aeroplane defensive" sounds right.

The rest may be fairly obvious.

THE RECONNAISSANCE OF THE ROUTE TO THE SEA HAS REVEALED THAT THE AEROPLANE DEFENSE WAS OVER THE TOWN.

According to [MANS], actually "AEROPLANE DEFENSE" should be "air defenses."

Commercial Code

Some telegraphic codebooks provided for some measure of secrecy (see another article). Continental codebooks like Sittler represented a word with a four-digit number consisting of a page number and a line number and correspondents could make their own arrangement to reassign page numbers to keep the secrecy of their message.

Etienne Bazeries demonstrated how easily such enciphering could be broken (Bazeries, Les chiffres secrets dévoilés, p.142-145).

He was given the following message coded by Sittler.

2213 2379 2836 5034 6360 9051
1302 1086 7131 2394 7514 1933

Given a hint that the message was about finance, he started by assuming some words such as bourse, titres, millions were used in the message.

The word million appears on real page 57, line 04 in Sittler. Although the page numbers might be reassigned, the line number 04 is assumed to be unchanged. Then, it is noted that the fourth group 5034 contains both 0 and 4. The 1st and 3rd digits may indicate the page and the 2nd and the 4th digits may indicate the line. If this group indeed is million, the preceding group 2836 would be a figure. From its 2nd and 4th digits, one might look for a numeral appearing on line 86 on some page. This finds 17 on page 27 and 41 on page 74.

Considering that the real page 57 is assigned "53" (1st and 3rd digits), the difference being 4, it is noted that "23" formed by the 1st and 3rd digits of 2836 is also displaced from the real page 27 by 4.

With the assumptions thus far, the preceding group 2379 must refer to real page 31 (=27+4), line 39, which gives emprunter (borrow), which nicely fits the context.

The plaintext turned out to be "Je désire emprunter 17 millions, pouvez-vous vous charger de les réaliser et à quelles conditions?"

Panizzardi Telegram (1894)

The Panizzardi telegram, intercepted at an early stage of the Dreyfus Affair, was also enciphered based on Baravelli's commercial code. As with the above, the unchanged line numbers gave a clue to the codebreakers. See another article for details.

Statistical Analysis

The above two examples are only the simplest cases. Actual codebreaking seldom proceeds as readily as these and usually requires much more materials in code to get an insight.

After finding out what code groups occur how frequently, codebreakers examine what other code groups tend to precede or follow each particular code group. Before computers or tabulators (Wikipedia) were introduced, such a task required many typists, as told by Yardley (Chapter XIV). By such a tedious process, code groups are identified one by one.

Codebreaking by Espionage

Codebreakers were occasionally given a head start by espionage.

Spanish Codes (1918)

Yardley's The American Black Chamber, Chapter VIII, describes how he used agents to obtain information to break Spanish codes.

In 1918, Yardley was urged for solution of Spanish codes because Spain was suspected of assisting German espionage. Yardley sent an agent to a Spanish consulate in South America. The agent stole into the consulate at night, opened the steel safe, and found the diplomatic code but it would take time to photograph the entire codebook because only a few pages could be photographed each night. Although the code did not match the messages between Spain, America, and Germany, Yardley had expected that. Analysis indicated there were about ten different codes for different channels but Yardley thought there were only one or two basic codes and the others were merely secondary codes based on them.

Yardley hired a woman to draw information from a Spanish diplomatic secretary in the Spanish embassy in America. As it turned out, the Spaniards used 25 different codes for different stations, which were grouped into 9 different groups.

Then, a circular telegram was sent from the Spanish Foreign Office in Madrid in four different codes: to Washington and Costa Rica in code number ("indicator") 301, to Lima in 141, to Santo Domingo in 32, and to Panama in 74.

Soon, the agent provided the photograph of the Code No.74. It assigned four-figure numbers to alphabetically arranged words. The plaintext entry was written as, e.g., "abdic-ar-acion-es", which represented various inflected forms of the word, of which the decoder should identify the intended one according to the context. (This kind of entry is also seen in Napoleonic codes.)

This allowed decoding the circular telegram. With the plaintext revealed, it was a matter of time to identify code groups for the other codes.

Gray Code

Yokoi Toshiyuki, Teikoku Kaigun Kimitsu Shitsu (The Black Chamber of the Imperial Navy) (1953) describes decoding by the Japanese of the Gray Code of the US Department of State.

The Japanese Navy designated the code as NADED because the code group NADED occurred very frequently. First, code groups of the intercepted messages (which were in the form of either CVCVC or CVCCV, with C being consonant and V vowel) were alphabetically arranged and their occurrences were recorded. The most frequent group NADED was readily identified as a period.

But it was far from decoding messages. The Navy asked the military police to obtain waste paper from an American consulate. Laborious searching of the waste was sometimes rewarded with a draft, which revealed some code groups and their plaintext counterparts. When the US ambassador submitted a memorandum of the US Department of State to the Japanese government, a telegram intercepted before that had counterparts for the memorandum, which revealed many code groups.

Before long, about 5000 code groups were identified. Although it was only a small fraction of what seemed to have 100,000 entries, it allowed the general meaning of telegrams to be known.

Japanese Attempts at British Codes

In late 1934, the Navy asked the military police to get hold of a British codebook. When the British consulate in Sapporo was moving, a Japanese typist helping the work dropped from a window of the second floor a codebook, which was picked up and carried away by a workman. In about an hour, the codebook was returned to the original casket. In March 1935, the stolen codebook, which was a British inter-departmental codebook, allowed decoding of a telegram of the commander of the British Eastern fleet in Hankou to the ministers at home.

It was, however, a rarely used one. The main code of the British Foreign Ministry was a two-part code. In the summer of 1935, when some 400 code groups had been identified, the code was replaced.

Again, they resorted to the military police, who apparently had already made attempts at the request of the army. But its first attempt at an embassy in Tokyo had been a fiasco. The agent could open the safe but was detected and arrested. (Yokoi deplores their naïveté by referring to Yardley, whose agent stole into a consulate in South America rather than in Washington (Yardley p.192).)

Success came from Osaka. A military police officer disguised as a tailor had obtained access to the British consulate in Osaka since late 1934 and soon succeeded in inducing a clerk to obtain an imprint of the key to the safe in wax. One night when the consul was absent, Japanese agents opened the safe with a copy key and photographed ten-plus codebooks.

The codebooks were delivered to the Navy's secret agency in Shanghai. However, this was not the end, for important telegrams tended to be missing in the intercepts. As it turned out, they were sent via wireline rather than radio. The Japanese agency had to bribe telegraph operators of the Commercial Pacific and Great Eastern cable companies to obtain copies of telegrams.

Other Examples

Similar activities abound in history. When Yardley was struggling with Japanese telegrams (see another article), he contemplated stealing into the Japanese consulate in New York (Yardley p.264) or handing the Japanese military attaché a memorandum for transmission to Japan in order to obtain an encoded message of a known plaintext (ibid. p.266-268).

In 1935, the US Navy stole into a Japanese naval attache's apartment in Washington in an attempt to obtain information of the RED cipher machine.

Back in 1905, it was discovered that Japanese codebooks at the consulate at The Hague had been secretly photographed by a Russian agent.



©2014 S.Tomokiyo
First posted on 1 May 2014. Last modified on 1 January 2016.
Articles on Historical Cryptography
inserted by FC2 system