DBSCAN image-clustering based automated reading and decryption of fairy cipher glyphs from children's book code puzzle!
(see the original full ipynb file in my aganse/fowl_fairy_code repo in GitHub…)
This wonderful kids' series is fun not only for the stories themselves, but also because each of the first several books involves a cipher puzzle with "fairy hieroglyphics". I love code puzzles! In the electronic form of the books I discovered the hieroglyphic sequence was moved to the back of the book, all perfectly lined up in matrices over a few pages at the end. And I thought, hey that seems like it'd be easy to parse and decrypt on a computer, just like the main character did!
It has worked out moderately well so far. My original plan was to use frequency-analysis and/or an iterative algorithm using dictionary comparison to auto-decrypt the glyphs (after they were converted to numbers) without the cipher key. (Some refs I was looking at are this paper and this python package). But before getting farther with that I've simply manually entered the key from the book. The glyph recognition and parsing is still all automated, which is why the result isn't perfect at the end - in particular there remain some missing letters that were considered noise by the clustering; maybe some further tuning could improve that. But it's enough to read the message!
The python notebook itself (linked above) contains all the code, but for readability/length I just pasted the results and annotations for each step below.
There were four pages (the image files here) full of fairy glyphs, plus one page with the key (hopefully we can solve without it, but it's here just in case). We read the images of glyphs, and break each image first into individual rows and then each row into individual glyphs in the order that they appear in the ciphertext. Here we see how the mean of each image row shows us where we can split apart the rows. (Same approach is then used to split apart the glyphs.) Note the mean intensity=1 rows are all whitespace, and the low mean intensity rows are where the black glyphs are. The way the split works, the whitespace rows end up all being 1 pixel high (those get filtered out by the "len(row) > 1" conditionals), and the glyph rows end up being more than that (like about 21 high). But you can see how this method relies on the precisely laid-out format of the glyphs in these matrices.
Here are examples of what the rows look like...
And then the individual glyphs, split into a list of 2D numpy arrays the same way, are reshaped to vectors for clustering (but reshaped back to 2D when plotting).
So now let's cluster that list of over 2600 glyph vectors into the known number of possible characters (29, admittedly known from the key, so the 26 alphabet characters plus space, period, and comma). Then use the clusters to label each glyph, and that becomes our ciphertext we can work with further. DBSCAN definitely seemed to work better for this than the others; but all were pretty sensitive to param choice.
Let's look at how well the clustering did on the glyphs here. If the clustering worked perfectly there would be exactly 29 labels, one for each letter plus comma, period, and space from the key, and all elements in each cluster would be the same symbol. But there are slight variations in the glyphs from the text (even from its electronic form - maybe due to image compression?) and the clustering is not perfect, which we'll see below. But DBSCAN with these parameters worked better than the other clustering algorithms I tried above. An additional handy thing about this particular algorithm is that rerunning it reproduces the same answer. All the algorithms had difficulty distinguishing the comma (few dots) from the period (several dots) glyphs and mixed them.
Label number used in the ciphertext is at far left. Only plotting first 20 glyphs in each cluster but note many clusters have hundreds of glyphs...
Now we group the glyphs by their cluster labels (the integer label numbers) to find the letter frequencies.
The hope here is that we can use the standard English letter frequencies and see how readily we can come up with a method to converge onto the actual cipher. From Wikipedia entry on cryptanalysis frequency analysis: "Given a section of English language, E, T, A and O are the most common, while Z, Q and X are rare. Likewise, TH, ER, ON, and AN are the most common pairs of letters (termed bigrams or digraphs), and SS, EE, TT, and FF are the most common repeats. The nonsense phrase 'ETAOIN SHRDLU' represents the 12 most frequent letters in typical English language text."
...Ok that ended up becoming more involved than planned. Still want to look into some of the automated methods mentioned at top of page, but meanwhile for first steps let's use the key and use it to substitute characters based on the rows of glyphs above.
So now instead we apply the current cipher solution to the ciphertext to generate candidate plaintext. The "noise" labels from DBSCAN are initially -1 which end up as nans here, so I'll replace those with "_". The immediate output is generally readable, with our minds filling in the words that have underscores in them. These were the letters in the ciphertext that the DBSCAN clustering labeled as "noise". I'm still not sure why the clustering had so much difficulty labeling those several main stretches of glyphs…
Next we can try some heuristical methods (just replacement of similar word from dictionary; see code) to clean that up a bit - at least fixing most of the single underscore dropout words:
Still not perfect to be sure, but pretty neat to get that far in a mostly-automated method. I sure as heck wasn't going to sit down and substitute every one of those glyphs with pencil and paper. (...well ok I might have; I've been known to do it before!) Definitely curious to try some of those automated cipheranalysis routines mentioned at top of page sometime...
(see the original full ipynb file in my aganse/fowl_fairy_code repo in GitHub…)
This wonderful kids' series is fun not only for the stories themselves, but also because each of the first several books involves a cipher puzzle with "fairy hieroglyphics". I love code puzzles! In the electronic form of the books I discovered the hieroglyphic sequence was moved to the back of the book, all perfectly lined up in matrices over a few pages at the end. And I thought, hey that seems like it'd be easy to parse and decrypt on a computer, just like the main character did!
It has worked out moderately well so far. My original plan was to use frequency-analysis and/or an iterative algorithm using dictionary comparison to auto-decrypt the glyphs (after they were converted to numbers) without the cipher key. (Some refs I was looking at are this paper and this python package). But before getting farther with that I've simply manually entered the key from the book. The glyph recognition and parsing is still all automated, which is why the result isn't perfect at the end - in particular there remain some missing letters that were considered noise by the clustering; maybe some further tuning could improve that. But it's enough to read the message!
The python notebook itself (linked above) contains all the code, but for readability/length I just pasted the results and annotations for each step below.
There were four pages (the image files here) full of fairy glyphs, plus one page with the key (hopefully we can solve without it, but it's here just in case). We read the images of glyphs, and break each image first into individual rows and then each row into individual glyphs in the order that they appear in the ciphertext. Here we see how the mean of each image row shows us where we can split apart the rows. (Same approach is then used to split apart the glyphs.) Note the mean intensity=1 rows are all whitespace, and the low mean intensity rows are where the black glyphs are. The way the split works, the whitespace rows end up all being 1 pixel high (those get filtered out by the "len(row) > 1" conditionals), and the glyph rows end up being more than that (like about 21 high). But you can see how this method relies on the precisely laid-out format of the glyphs in these matrices.
Here are examples of what the rows look like...
And then the individual glyphs, split into a list of 2D numpy arrays the same way, are reshaped to vectors for clustering (but reshaped back to 2D when plotting).
So now let's cluster that list of over 2600 glyph vectors into the known number of possible characters (29, admittedly known from the key, so the 26 alphabet characters plus space, period, and comma). Then use the clusters to label each glyph, and that becomes our ciphertext we can work with further. DBSCAN definitely seemed to work better for this than the others; but all were pretty sensitive to param choice.
Let's look at how well the clustering did on the glyphs here. If the clustering worked perfectly there would be exactly 29 labels, one for each letter plus comma, period, and space from the key, and all elements in each cluster would be the same symbol. But there are slight variations in the glyphs from the text (even from its electronic form - maybe due to image compression?) and the clustering is not perfect, which we'll see below. But DBSCAN with these parameters worked better than the other clustering algorithms I tried above. An additional handy thing about this particular algorithm is that rerunning it reproduces the same answer. All the algorithms had difficulty distinguishing the comma (few dots) from the period (several dots) glyphs and mixed them.
Label number used in the ciphertext is at far left. Only plotting first 20 glyphs in each cluster but note many clusters have hundreds of glyphs...
Now we group the glyphs by their cluster labels (the integer label numbers) to find the letter frequencies.
freq | label |
---|---|
341 | 0 |
183 | 1 |
121 | 2 |
93 | 3 |
130 | 4 |
... | ... |
The hope here is that we can use the standard English letter frequencies and see how readily we can come up with a method to converge onto the actual cipher. From Wikipedia entry on cryptanalysis frequency analysis: "Given a section of English language, E, T, A and O are the most common, while Z, Q and X are rare. Likewise, TH, ER, ON, and AN are the most common pairs of letters (termed bigrams or digraphs), and SS, EE, TT, and FF are the most common repeats. The nonsense phrase 'ETAOIN SHRDLU' represents the 12 most frequent letters in typical English language text."
...Ok that ended up becoming more involved than planned. Still want to look into some of the automated methods mentioned at top of page, but meanwhile for first steps let's use the key and use it to substitute characters based on the rows of glyphs above.
So now instead we apply the current cipher solution to the ciphertext to generate candidate plaintext. The "noise" labels from DBSCAN are initially -1 which end up as nans here, so I'll replace those with "_". The immediate output is generally readable, with our minds filling in the words that have underscores in them. These were the letters in the ciphertext that the DBSCAN clustering labeled as "noise". I'm still not sure why the clustering had so much difficulty labeling those several main stretches of glyphs…
_ongrat_lations, hu_an, if you ha_e deciphe_ed this co_e then yo_ are more intelligent thanmost of your spe_ies, other readers wi_l presume that _h_s code is merely a _eco_ation, _utyou ha_e rightly deduced that it is a message from the fa_ry _eople, i ha_e planted this comm_nicationin order to see_ o_t our allies among the mud men, though most humans are dull wit second tedcreatures, there are e_cept_on_, you, for e_am_e, the _eason for your intelligence is thatyou ha_e _____ _________, _o you feel different from those around you, ha_e you ha_e e_er thou_htthat you do not _elong among the mu_ men, these feelin_s occu_ _ecause yo_r fa_ry personal_tyis as_e_ting itself, if you a_e pale s_inne_ an_ _pend an _nor_ina_e amount of time on thetoilet, then your fore_a_he_ w_s pro___ly a sun _hy, gas _enting dwarf, if your ton_ue _s o__uf_icient length to touch your nose, then you are de_cended f_om _he go_lin race, _o_linsha_e no eyel___, and _herefore must lic_ their eye_alls to _eep them moist, if yo_ _ream off_ight, and cannot _esist admi_in_ yourself in any _eflecti_e surface, _hen you are part spri_e,that _ain race of air_____ _______, __ ___ ____ ____ots, are you a computer _enius, _ew_tche__y information _nd technology, then, oh luc_y human, you ha_e _en___r _lood in you, i _m your_roth__, _______ ___ ____ ___ _____ ___, __ ___ _____ ____ ___su_e hours wal_in_ leafy trailso_ clim__n_ the hi_he_t pea_, are your ears a _it pointier than nor_al, then you _elo_g tothe elf tr__e, are yo_ prone to great fits of anger, do you ro_r and _awl at the slightestp_o_ocation, a_e you slightly thic_, then you_ fo_efathe_s were trolls, and you _ro_a_ly wont_e a_le to tr_nslate this any_ay, if yo_ ha_e ___ognised yourself as part fairy, then i ha_ea m_ss_on fo_ yo_,_as one of the people _t is you_ duty to protect the earth from those whowo_ld dest_oy it, _ecome one of a n__ ____ __ ___ ___ ___ ____ this _lanet _s _uch _s the _airyfol_, the_e _s one s__p_e rule use only wh_t you need, and use it wisely, follow this _a__m,an_ mothe_ nature wil_ he_l herself, if you wish to meet your fairy ances_ors and f_nd outhow you may _urther aid our cause, you must f_rst _omplete the ancient r_tual from the earththine power flows, _i_en thro_gh courte_y, __ ______ ___ ____, pluc_ thou the m_gi__ see_,where full moon, ancient oa_ and twiste_ w_ter meet, and _ury it far from where it was found,so ret__n your gift into _he ground, once you ha_e _one _his, we will come to you, _o now,_nd _egin your _uest, i shall re_eat this mes_age fo_ those humans whose fai_y intelligen_eis __rie_ a _it deep__ ____ _____,
Next we can try some heuristical methods (just replacement of similar word from dictionary; see code) to clean that up a bit - at least fixing most of the single underscore dropout words:
congratulations, human, if you have deciphered this code then you are more intelligent thanmost of your species, other readers will presume that _h_s code is merely a evacuation, outyou have rightly deduced that it is a message from the fairy people, i have planted this communicationin order to seer oat our allies among the mud men, though most humans are dull wit second tedcreatures, there are receipting, you, for reamer, the reason for your intelligence is thatyou have _____ _________, co you feel different from those around you, have you have ever thoughtthat you do not belong among the mud men, these feelings occur because your fairy personalityis asserting itself, if you ace pale s_inne_ any spend an _nor_ina_e amount of time on thetoilet, then your fore_a_he_ was pro___ly a sun shy, gas denting dwarf, if your tongue as ohsufficient length to touch your nose, then you are descended from she goblin race, _o_linshave no eyel___, and therefore must lick their eyeballs to beep them moist, if you rearm offought, and cannot desist admiring yourself in any _eflecti_e surface, then you are part sprier,that fain race of air_____ _______, __ ___ ____ ____ots, are you a computer genius, _ew_tche_by information and technology, then, oh lucky human, you have _en___r blood in you, i mm your_roth__, _______ ___ ____ ___ _____ ___, __ ___ _____ ____ ___su_e hours walling leafy trailsoh claiming the hairnet pear, are your ears a bit pointier than normal, then you eulogy tothe elf trier, are you prone to great fits of anger, do you roar and bawl at the slightestp_o_ocation, ace you slightly thick, then your forefathers were trolls, and you _ro_a_ly wontbe able to translate this anyway, if you have ___ognised yourself as part fairy, then i havea mission foe yo__as, one of the people wt is your duty to protect the earth from those whoworld destroy it, become one of a n__ ____ __ ___ ___ ___ ____ this planet as ouch as the dairyfoll, their as one s__p_e rule use only what you need, and use it wisely, follow this _a__m,any mother nature wile heal herself, if you wish to meet your fairy ancestors and fend outhow you may further aid our cause, you must frost complete the ancient ritual from the earththine power flows, _i_en through courtesy, __ ______ ___ ____, pluck thou the m_gi__ seer,where full moon, ancient oar and twister water meet, and bury it far from where it was found,so retain your gift into she ground, once you have bone chis, we will come to you, co now,and begin your guest, i shall reheat this message foe those humans whose fairy intelligenceis __rie_ a bit deeper ____ _____,
Still not perfect to be sure, but pretty neat to get that far in a mostly-automated method. I sure as heck wasn't going to sit down and substitute every one of those glyphs with pencil and paper. (...well ok I might have; I've been known to do it before!) Definitely curious to try some of those automated cipheranalysis routines mentioned at top of page sometime...