Voynich Manuscript - Latin Texts
Sarah Goslee
2006-10-22
1 Introduction
Statistical analysis of an unknown text is useless without some other known text to compare it to. Latin was the language of scientific writing throughout the medieval period. While vernacular manuscripts are known from the likely timeperiod of the Voynich MS, Latin seems like a reasonable starting point. So as not to bias the comparision toward a particular authorial style or topic, I chose three Latin texts on different topics, all available online as full texts.
- l1: Apuleius - de Mundo
(http://www.gmu.edu/departments/fld/CLASSICS/apuleius.mundo.html)
- l2: TITI LVCRETI CARI DE RERVM NATVRA LIBER PRIMVS
(http://www.thelatinlibrary.com/lucretius1.html)
- l3: Isidorus Hispalensis - De natura rerum
(http://www.forumromanum.org/literature/isidorus_hispalensis/natura.html)
Before analysis I removed all punctuation and converted all capital letters to lowercase (EXCEPT those that were parts of Roman numerals). None of the texts were paginated, so I treated the paragraph as the basic unit. Spaces were replaced with "." and paragraph beginnings and endings were marked with "=". The texts did not contain linebreaks, so those were not marked. I also removed the lines of poetry from l1 and the chapter headings from l3, since I've only been using paragraph text from the VMS.
2 Ordination
Figure 1: Ordination of character frequencies for three Latin texts, by paragraph.
None of the three Latin texts show any clear internal groupings based on character frequencies (Fig.
1).
Figure 2: Ordination of word frequencies for Voynich and three Latin texts, by paragraph.
Ordination on word frequencies picks some topical outliers in l2, but no other consistent groupings (Fig.
2).
Figure 3: Ordination of character frequencies for Voynich text, by paragraph.
Figure 4: Ordination of character frequencies for three pooled Latin texts, by paragraph.
Figure 5: Ordination of word frequencies for Voynich and three pooled Latin texts, by paragraph.
Figure 6: Ordination of word frequencies for Voynich and three pooled Latin texts, by paragraph.
Question: are the VMS sections more or less different than can be explained by topic?
The A/B language distinction in the VMS shows up reasonably well in an ordination on character frequencies (Fig.
3). No similar division is apparent in the Latin texts. An ordination on the character frequencies of all three texts togethers demonstrates that they all have very similar character frequences (Fig.
4).
The A/B language distinction in the VMS shows up even more clearly in an ordination on word frequencies (Fig.
5). The Latin texts show some separation, but there is considerable overlap (Fig.
6).
3 Character Frequencies
The ordination analyses suggested that there was little difference in character frequency patterns among the three Latin texts. Unlike the VMS subsets, there is little difference in overall character frequencies between the different Latin texts (Fig.
7).
Figure 7: Character distribution in the three Latin texts.
Figure 8: Word frequencies in the three Latin texts.
The word frequency distributions are not as extreme as that of set H of the VMS (Fig.
8). In l1 and l2,
et has more than three times as many occurrences as the next most common word.
Ten most common words in l1 (percentage occurrence):
et est cum in ut ad quae sunt atque ex
5 1 1 1 1 1 1 1 1 1
Ten most common words in l2:
et in quod quae nec res rebus esse rerum atque
2 2 1 1 1 1 1 1 1 1
Ten most common words in l3:
et in est autem ex quod ad a cum per
4 2 2 1 1 1 1 1 1 1
The same word is most frequent in all three Latin texts, and another appears in the top four in all texts, but the same is not true of the Voynich sections. Set H shows extreme skew, and no word is found in the top four in any two sets.
Word length is longer in the Latin texts than in the VMS, and the mean number of occurrences of each word is lower (Table
1). The percentage of words that occur only once is similar in the Latin texts and VMS, although l1 is lower than the others.
| Paras | Chars | Word occ. | Word length | Words | N occ. | Pct. Unique Words |
l1 | 39 | 39250 | 6479 | 6.1 | 3404 | 1.9 | 22 |
l2 | 69 | 40598 | 7266 | 5.6 | 2699 | 2.7 | 32 |
l3 | 172 | 62546 | 10785 | 5.8 | 4010 | 2.7 | 32 |
Table 1: Text Characteristics