Bigram Count (with Microsoft Excel)

Bigram Count Tools for Simple Text

In codebreaking, bigram count is often useful. A bigram is a pattern of a letter (symbol) followed by another. Bigram count shows by what letters (symbols) a particular letter (symbol) is often followed or preceded. In English, for example, if a symbol is always followed by some particular symbol, the combination may be Q+U.

Bigram frequency data is available at, e.g., Wikipedia.


Today, for a simple text, resources are available on the Web to do the bigram count:

Bigram Counter (Mount Holyoke College)

Frequency Analysis (Dcode)

Monogram, Bigram and Trigram frequency counts (Practical Cryptography)

Source codes are also found by Googling and may be useful for tuning for specific purposes.

Handling Symbols Consisting of More Than One Letter

The tools above assume the input text (whether encrypted or not) is written in ordinary letters and do not serve for the purpose of counting bigrams in a ciphertext such as "40 37 22 40 23 13 23 38 17 31 25 ...", the kind of ciphertext very often encountered in history. Actually, there is a tool on the Web that can handle this type of input.

Online NGram Analyzer (developed by p00d13 1337)

This might have been the end of this article.

But before finding this tool, I relied on an approach with Microsoft Excel, which is presented below.

(Even before that, for a long time, I used a tedious way of creating a list of bigrams with a text editor and then following an ordinary manner to count the records. For example, a regular expression can be used to replace "40 37" with "40) (40 37) (37"; "22 40" with "22) (22 40) (40", etc. to obtain a list of bigram records (40 37) (37 22) (22 40) (40 23) (23 13) (13 23) (23 38) (38 17) (17 31) (31 25)". This reduces the problem into record counting, which can be done with Microsoft Excel as described in another article.)

Bigram Count with Microsoft Excel

The following presents a method of generating a bigram listing from text, which may consist of words or figure groups. Once a bigram listing is obtained, the record counting method of another article can be used to generate bigram statistics.

Googling finds several implementations for creating a bigram listing by using simple macros in Microsoft Excel.

Sub bigrams()
    Dim K As Long, i As Long
    K = 2
    tokens = Split(Range("A1").Value, " ")
' Splitting the text in A1 into elements in an array "tokens"
' (The text to be analyzed is put in A1.)

    For Each a In tokens
        Cells(K, 1).Value = a
        K = K + 1
    Next a
' First column receives words in the array

    K = 2
    For i = LBound(tokens) To UBound(tokens) - 1
        Cells(K, 2).Value = tokens(i) & " " & tokens(i + 1)
        K = K + 1
    Next i
' Second column receives a word concatenated with the following word (i.e., a bigram listing)

    K = 2
    For i = LBound(tokens) To UBound(tokens) - 2
        Cells(K, 3).Value = tokens(i) & " " & tokens(i + 1) & " " & tokens(i + 2)
        K = K + 1
    Next i
' Third column will be a trigram listing

End Sub

Those who are not accustomed with Visual Basic (VBA) (like me) will wonder "Where I should write this script?". (The following worked for me but I'm not sure it is correct. I use "Microsoft Excel 2016" in Japanese.)

First, paste the text to be analyzed in cell A1.

Click on "Developer" (開発), then "Visual Basic", which opens a VBE (Visual Basic Editor) window. Click on "Insert" (挿入), then "Module" (標準モジュール (M)), which opens a blank window and adds a default name "Module1" in the list in the left side bar.

Paste the above script in the blank window. Just closing the window seems to save the script.

Right-click on "Module1" in the side bar and choose "Export File" (ファイルのエクスポート (E)). (I used the name "ModuleBigram.bas" but any name with the extension ".bas" will do.)

Right-click on "Module1" in the side bar and choose "Import" (ファイルのインポート (I)). (I'm sure I'm doing some redundant operations here but I know no better way.)

Close the VBE window.

In the spreadsheet (of which the cell A1 contains the text to be analyzed), click on "Developer" (開発) again and now select "Macros" (マクロ). Select "bigrams" (which is the name of the above script) and click on "Run" (実行).

The column A will show the original text split into tokens such as words or figure groups. The column B will provide the bigram listing wanted. The column C will give a trigram listing.

It will be straightforward to modify the above script to list bigrams in reverse order (to serve as a sort key when you want to see what letter (group) precedes (rather than follows) a particular letter (group) most often.) etc.



©2017 S.Tomokiyo
First posted on 25 December 2017. Last modified on 25 December 2017.
Cryptiana: Articles on Historical Cryptography
inserted by FC2 system