The first step in codebreaking is to count the frequency of symbols/letters occurring in a cryptogram. The resulting frequency counts of the symbols/letters help identifying the plaintext letters by reference to known statistical characteristics of the letters. Such frequency analysis is well-known by Edgar Allan Poe's *Gold Bug* as well as many other examples (see another article for the example of Benjamin Church's cipher). Letter frequency data for various languages is available at, e.g., Wikipedia and Practical Cryptology.

Today, for a simple text, resources are available on the Web to do the frequency count:

Frequency Counter (CryptoPrograms)

Frequency Analysis (Rumkin.com)

Frequency Counter (Mount Holyoke College)

Frequency Analysis (Dcode)

Monogram, Bigram and Trigram frequency counts (Practical Cryptology)

(After uploading this article, I found the following, which serves for the purpose of this article below.)

However, the above tools are not applicable when symbols in ciphertext consists of more than one letter such as letter groups or two (or more) digit figures. Whatever symbols are used, as long as there are no more than 26 different symbols (as in the case of monoalphabetic cipher, polyalphabetic cipher, or transposition cipher), the cipher symbols may be replaced by the letters a-z, to which the above tools may be applied. However, frequency counting is also required for ciphertexts employing more than 26 different symbols (as in the case of homophonic substitution cipher, code, or a combination of substitution cipher and code, sometimes called a nomenclator).

For example, when there is a ciphertext "40 37 22 40 23 13 23 38 17 31 25 ...", a text editor can be used to separate the two-digit figure symbols into records such as:

40 37 22 40 23 13 23 38 17 31 25 |

How can one count the frequency of these records? For a long time since the days of MS-DOS (something like "Command Prompt" of today's Windows operating system), I used tools such as SORTF.EXE or JS.EXE. These were tools for sorting records but had an option to count the frequency of unique records. However, I found to my dismay that these do not appear to work on Windows 10.

What can I do? Today, many people use Microsoft Excel or other spreadsheet software for data analysis. Such an advanced application must be able to do the simple job of record counting.

Indeed, Microsoft Excel provides the "FREQUENCY" function but it does not seem to serve for the above purpose.

The best way to count the frequency of records of any symbols (that can be handled on PC) appears to be: to create a unique list from the input record list and then use the COUNTIF function.

The following illustrates this procedure with an example ciphertext: "40 37 22 40 23 13 23 38 17 31 25 36 28 31 22 23 38 36 25 16 24 15 30 36 16 39 21 22 96 12 13 31 31 32 40 35." (I use "Microsoft Excel 2016" in Japanese.)

A text editor can be used to separate the figures into records by replacing a space with a linebreak ("\n" in my editor). Then, the list of figures is pasted into the columns A and B of Excel.

The column A is to be used as the input record list. The column B is sorted ("Sort"; "並べ替え") and duplicate records are gathered in unique records ("Remove Duplicates"; "重複の削除"). The resulting column B is a unique list of symbols (figures/records) used in the input record list. (It seems "Remove Duplicates" without sorting first does not provide the desired result.)

Enter "=COUNTIF($A$1:$A$36,$B1)" in cell C1. $A$1 and $A$36 indicate the starting cell and the last cell in the input record list of column A. $B1 specifies the symbol in cell B1 that is to be counted from the input record list. (The dollar sign ($) *must not* be placed before "1" because this should be what is called "relative cell reference" rather than "absolute cell reference".) (NB. I find "A:A" can select the entire column. So, **"=COUNTIF(A:A,$B1)" would be simpler.**)

Copy the formula in cell C1 to the subsequent cells of column C (to the same length as column B). Thanks to the "relative cell reference", the symbol to be counted changes to an appropriate cell. The resultant column C provides frequency counts for the corresponding cells in column B.

©2017 S.Tomokiyo

First posted on 26 November 2017. Last modified on 31 December 2017.

Articles on Historical Cryptography