2025-05-18 12:09:00
github.com
This started as a personal challenge to figure out what modern NLP could tell us about the Voynich Manuscript — without falling into translation speculation or pattern hallucination. I’m not a linguist or cryptographer. I just wanted to see if something as strange as Voynichese would hold up under real language modeling: clustering, POS inference, Markov transitions, and section-specific patterns.
Spoiler: it kinda did.
This repo walks through everything — from suffix stripping to SBERT embeddings to building a lexicon hypothesis. No magic, no GPT guessing. Just a skeptical test of whether the manuscript has structure that behaves like language, even if we don’t know what it’s saying.
The Voynich Manuscript remains undeciphered, with no agreed linguistic or cryptographic solution. Traditional analyses often fall into two camps: statistical entropy checks or wild guesswork. This project offers a middle path — using computational linguistics to assess whether the manuscript encodes real, structured language-like behavior.
/data/
AB.docx # Full transliteration with folio/line tags
voynichese/ # Root word .txt files
stripped_cluster_lookup.json # Cluster ID per stripped root
unique_stripped_words.json # All stripped root forms
voynich_line_clusters.csv # Cluster sequences per line
/scripts/
cluster_roots.py # SBERT clustering + suffix stripping
map_lines_to_clusters.py # Maps manuscript lines to cluster IDs
pos_model.py # Infers grammatical roles from cluster behavior
transition_matrix.py # Builds and visualizes cluster transitions
lexicon_builder.py # Creates a candidate lexicon by section and role
cluster_language_similarity.py # (Optional) Compares clusters to real-world languages
/results/
Figure_1.png # SBERT clusters (PCA reduced)
transition_matrix_heatmap.png # Markov transition matrix
cluster_role_summary.csv
cluster_transition_matrix.csv
lexicon_candidates.csv
- Clustering of stripped root words using multilingual SBERT
- Identification of function-word-like vs. content-word-like clusters
- Markov-style transition modeling of cluster sequences
- Folio-based syntactic structure mapping (Botanical, Biological, etc.)
- Generation of a data-driven lexicon hypothesis table
One of the most important assumptions I made was about how to handle the Voynich words before clustering. Specifically: I stripped a set of recurring suffix-like endings from each word — things like aiin, dy, chy, and similar variants. The goal was to isolate what looked like root forms that repeated with variation, under the assumption that these suffixes might be:
- Phonetic padding
- Grammatical particles
- Chant-like or mnemonic repetition
- Or… just noise
This definitely improved the clustering behavior — similar stems grouped more tightly, and the transition matrix showed cleaner structural patterns. But it’s also a strong preprocessing decision that may have:
- Removed actual morphological information
- Disguised meaningful inflectional variants
- Introduced a bias toward function over content
So it’s not neutral — it helped, but it also shaped the results.
If someone wants to fork this repo and re-run the pipeline without suffix stripping — or treat suffixes as their own token class — I’d be genuinely interested in the comparison.
- Cluster 8 exhibits high frequency, low diversity, and frequent line-starts — likely a function word group
- Cluster 3 has high diversity and flexible positioning — likely a root content class
- Transition matrix shows strong internal structure, far from random
- Cluster usage and POS patterns differ by manuscript section (e.g., Biological vs Botanical)
The manuscript encodes a structured constructed or mnemonic language using syllabic padding and positional repetition. It exhibits syntax, function/content separation, and section-aware linguistic shifts — even in the absence of direct translation.
# 1. Install dependencies
pip install -r requirements.txt
# 2. Run each stage of the pipeline
python scripts/cluster_roots.py
python scripts/map_lines_to_clusters.py
python scripts/pos_model.py
python scripts/transition_matrix.py
python scripts/lexicon_builder.py
- Cluster-to-word mappings are indirect — frequency estimates may overlap
- Suffix stripping is heuristic and may remove meaningful endings
- No semantic translation attempted — only structural modeling
This project was built as a way to learn — about AI, NLP, and how far structured analysis can get you without assuming what you’re looking at. I’m not here to crack the Voynich. But I do believe that modeling its structure with modern tools is a better path than either wishful translation or academic dismissal.
So if you’re here for a Rosetta Stone, you’re out of luck.
If you’re here to model a language that may not want to be modeled — welcome.
This project is open to extensions, critiques, and collaboration — especially from linguists, cryptographers, conlang enthusiasts, and computational language researchers.
Keep your files stored safely and securely with the SanDisk 2TB Extreme Portable SSD. With over 69,505 ratings and an impressive 4.6 out of 5 stars, this product has been purchased over 8K+ times in the past month. At only $129.99, this Amazon’s Choice product is a must-have for secure file storage.
Help keep private content private with the included password protection featuring 256-bit AES hardware encryption. Order now for just $129.99 on Amazon!
Help Power Techcratic’s Future – Scan To Support
If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.
As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!
BITCOIN bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge Scan the QR code with your crypto wallet app |
DOGECOIN D64GwvvYQxFXYyan3oQCrmWfidf6T3JpBA Scan the QR code with your crypto wallet app |
ETHEREUM 0xe9BC980DF3d985730dA827996B43E4A62CCBAA7a Scan the QR code with your crypto wallet app |
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.