Listing Recurring Polygrams from Ciphertext (with Perl)

Finding repetition is an important step in codebreaking. Many tools are available to count bigrams from ciphertext (or any text) (see, e.g., another article). Some can even find trigrams, tetragrams, pentagrams, and possibly more.

However, longer n-grams may occur repeatedly. With monoalphabetic substitution ciphers without homophones, repetition of a ten-letter word causes a recurring 10-gram. (With homophonic substitution ciphers, depending on the length of the ciphertext, repetition of a long sequence is relatively rare, and inspecting the context of the few repeating pentagrams is usually sufficient to find longer sequences, if any.) When one is faced with a ciphertext with many recurring patterns, it may be desired to find recurring n-grams without limit to n.

I wrote a simple Perl script to list recurring polygrams. (Here, I define a polygram to be an n-gram with n being large and indefinite.)

Polygram Script

#!/usr/local/bin/perl
# 複数出現するn-gram (n>=LEN)を検出

$cipher=("CIPHERTEXT");#暗号文を入力

$LEN=10; # $LEN-gram以上の繰り返しを取得
$i=0;
while ($i<length($cipher)-$LEN)
{

$j=$i+$LEN;

$isany=0;

while ($j<length($cipher)-$LEN)

{

if(substr($cipher,$i,$LEN) eq substr($cipher,$j,$LEN))

{ $isany++;

$k=1;

while( substr($cipher,$i,$LEN+$k) eq substr($cipher,$j,$LEN+$k) &&

$i+$LEN+$k < length($cipher)

) {$k++;} #比較文字列を伸ばせるだけ伸ばす; 後半の条件判断は不要のはずだが念のため

$recurring=substr($cipher,$i,$LEN+$k-1);

substr($cipher,$j+$LEN-1,1,"*"); #２回ヒットしないようにXで置き換えておく

if ( $isany==1 ) { print ("$recurring\n");}

print ("$recurring\n");

$j=$j+$LEN+$k;

}

else { $j++;}

}

if ($isany > 0) { $i=$i+$LEN+$k;}

else { $i++;}

}

Demonstration

As an example, when the following is put in place of CIPHERTEXT and LEN is set to 10,

12160176164864242114256.12160542264721648642384252422161764138.:671147122672422141762384176024472172242.70144120847250248625624181216050176226151764556^_86525425417240521782562701761365566714482241301712221071342522423801712160216224-245058156271507132176222612160122401345647216480556126214425252220242486742417^_6^_7^_27132864721722424136212262417458421417482172725^_4^_122210144120847272742507130713262122024212160.18.1765862167652524656240228254246472671547425680130713205562405254241227250725613017026220136256270.21471242413017621027617486472124854254.22242120130713255613623805252212132138116254827017148258242240252428455024218.25825256174172524614240136.2384725723801388556120242527274621471216024641301442450176685256256422024212160134872487417252422121326212222425484212172521222101648647222242842527017220241327270172384170727024413614.02407256470171467021470172121322202422202423805250.2121725041725240244156276727216224240841328055024213624013672782170240122212262402145072716486472220244171622446172121322402224050725455024241817017224701205562541216041717015847222265624021025655024241745848217272541216012240130821202425215088412025424011017417224512854542207132727562128042202422240162134172241227501586242405222713071322162384713216480550242134254174224242174.1841202545560247212227525240252561365425651442404171216072716486472212132121601224013450000

01441208472
01441208472
121601224013
121601224013
12160122401345
22024212160
22024212160
1648647222
1648647222
8055024213
8055024213
0727164864722
0727164864722
121601224*13
121601224*13

(before debugging substr($cipher,$j,$LEN-1,"*") into substr($cipher,$j+$LEN-1,1,"*"), the result was as follows)

01441208472
01441208472
121601224013
121601224013
12160122401345
22024212160
22024212160
1648647222
1648647222
8055024213
8055024213

Limitations

•The first "121601224013" should be "12160122401345", but it is printed only up to the length of the second matching "121601224013".

•More importantly, it doesn't seem to work properly when "121601224013" "12160122401345" "12160122401345" occur in this order (i.e., the second and third occurrences are longer than the first instance). I haven't debugged on this issue. (The cause may be related to the line "substr($cipher,$j,$LEN-1,"*")", but commenting out the line didn't fix it.)

•I'm not sure whether I treated boundary conditions properly. (I haven't checked if it works properly when two different recurring patterns follow one another without a break.)

•Redundant entries may be included in the output. The line "substr($cipher,$j,$LEN-1,"*")" is intended to prevent it by corrupting the last character, but it may not work when there are two matching patterns both with the corrupted character.

•It is slow for a longer ciphertext. The above result can be obtained in an instant (the input is about 1400 bytes). But when I tried with a ciphertext of about 14000 bytes long, it took about twenty seconds. I know this is an O(N²) algorithm, in which processing time increases with N² (N is the length of CIPHERTEXT), but I'm not sure there is not some bug causing the sluggishness.

Despite all these limitations, I hope this simple script helps codebreaking of ciphertext with many repeating patterns.