Voynich Manuscript - Latin Texts

Sarah Goslee

2006-10-22

1  Introduction

Statistical analysis of an unknown text is useless without some other known text to compare it to. Latin was the language of scientific writing throughout the medieval period. While vernacular manuscripts are known from the likely timeperiod of the Voynich MS, Latin seems like a reasonable starting point. So as not to bias the comparision toward a particular authorial style or topic, I chose three Latin texts on different topics, all available online as full texts.
Before analysis I removed all punctuation and converted all capital letters to lowercase (EXCEPT those that were parts of Roman numerals). None of the texts were paginated, so I treated the paragraph as the basic unit. Spaces were replaced with "." and paragraph beginnings and endings were marked with "=". The texts did not contain linebreaks, so those were not marked. I also removed the lines of poetry from l1 and the chapter headings from l3, since I've only been using paragraph text from the VMS.

2  Ordination

latin-charPCOcode.png
Figure 1: Ordination of character frequencies for three Latin texts, by paragraph.
None of the three Latin texts show any clear internal groupings based on character frequencies (Fig. 1).
latin-wordPCOcode.png
Figure 2: Ordination of word frequencies for Voynich and three Latin texts, by paragraph.
Ordination on word frequencies picks some topical outliers in l2, but no other consistent groupings (Fig. 2).
latin-sectionTypeCharA.png
Figure 3: Ordination of character frequencies for Voynich text, by paragraph.
latin-sectionTypeCharB.png
Figure 4: Ordination of character frequencies for three pooled Latin texts, by paragraph.
latin-sectionTypeWordA.png
Figure 5: Ordination of word frequencies for Voynich and three pooled Latin texts, by paragraph.
latin-sectionTypeWordB.png
Figure 6: Ordination of word frequencies for Voynich and three pooled Latin texts, by paragraph.
Question: are the VMS sections more or less different than can be explained by topic?
The A/B language distinction in the VMS shows up reasonably well in an ordination on character frequencies (Fig. 3). No similar division is apparent in the Latin texts. An ordination on the character frequencies of all three texts togethers demonstrates that they all have very similar character frequences (Fig. 4).
The A/B language distinction in the VMS shows up even more clearly in an ordination on word frequencies (Fig. 5). The Latin texts show some separation, but there is considerable overlap (Fig. 6).

3  Character Frequencies

The ordination analyses suggested that there was little difference in character frequency patterns among the three Latin texts. Unlike the VMS subsets, there is little difference in overall character frequencies between the different Latin texts (Fig. 7).
latin-chardistcode.png
Figure 7: Character distribution in the three Latin texts.
latin-wordlinecode.png
Figure 8: Word frequencies in the three Latin texts.
The word frequency distributions are not as extreme as that of set H of the VMS (Fig. 8). In l1 and l2, et has more than three times as many occurrences as the next most common word.
Ten most common words in l1 (percentage occurrence):
   et   est   cum    in    ut    ad  quae  sunt atque    ex 
    5     1     1     1     1     1     1     1     1     1 
Ten most common words in l2:
   et    in  quod  quae   nec   res rebus  esse rerum atque 
    2     2     1     1     1     1     1     1     1     1 
Ten most common words in l3:
   et    in   est autem    ex  quod    ad     a   cum   per 
    4     2     2     1     1     1     1     1     1     1 
The same word is most frequent in all three Latin texts, and another appears in the top four in all texts, but the same is not true of the Voynich sections. Set H shows extreme skew, and no word is found in the top four in any two sets.
Word length is longer in the Latin texts than in the VMS, and the mean number of occurrences of each word is lower (Table 1). The percentage of words that occur only once is similar in the Latin texts and VMS, although l1 is lower than the others.
Paras Chars Word occ. Word length Words N occ. Pct. Unique Words
l1 39 39250 6479 6.1 3404 1.9 22
l2 69 40598 7266 5.6 2699 2.7 32
l3 172 62546 10785 5.8 4010 2.7 32
Table 1: Text Characteristics