Voynich Manuscript - Basic Analyses

Sarah Goslee

2006-10-22

1  Introduction

I am working from a modified EVA transcription of the VMS (Reeds/Landini's interlinear file in EVA, version 1.6e6 - http://www.ic.unicamp.br/ stolfi/voynich/98-12-28-interln16e6/). Starting with that file, I constructed a consensus version, and I've fixed some apparent transcription errors by comparing the transcription with the high-resolution SIDs available courtesy of the Beinecke Rare Book and Manuscript Library of Yale University, current owners of the manuscript. (I downloaded them from the links available at www.voynich.nu.)
I've also made some other changes to match my conception of the symbol set (in no particular order):
A complete list of changes suitable for running as a source file in vim is contained in appendix 1.

2  Ordination

First, I looked at overall pattern of character frequencies (Fig. 1) and word frequencies (Fig. 2) between Currier languages and section types, using principal coordinates ordination on Euclidean distances of row-standardized frequencies. Labeling the same ordinations with section types as well as Currier language (Figs. 3, 4) showed that the language groups separate for both character and word frequencies (although not cleanly), but that thematic (topical(?) section) groups separate more clearly for word frequencies. This suggests to me that character frequency is determined by A/B ëncoding", while word frequencies are more closely related to theme.
voynich1-charABcode.png
Figure 1: Ordination of character frequencies in Currier A and Currier B pages (paragraph text only).
voynich1-wordABcode.png
Figure 2: Ordination of word frequencies in Currier A and Currier B pages (paragraph text only).
voynich1-charSeccode.png
Figure 3: Ordination of character frequencies labeled by section and Currier language (paragraph text only).
voynich1-wordSeccode.png
Figure 4: Ordination of word frequencies labeled by section and Currier language (paragraph text only).

3  Analysis of Distinct Subsets

I've decided to concentrate my initial analyses on three subsets of data chosen to be distinct based on supposed content and on separation in the ordination diagrams (especially word frequencies). Set R is the recipe section, set B is the balneological section, and Set H is the herbal section in the right tail of the word-frequency PCO. This gives two thematically-different sections in language B, and only one in language A. These groups are shown overlaid on the character ordination (Fig. 5) and the word ordination (Fig. 6).
voynich1-charSubcode.png
Figure 5: Ordination of character frequencies with group membership superimposed.
voynich1-wordSubcode.png
Figure 6: Ordination of word frequencies with group membership superimposed.
As suggested by the ordination diagram, the character distribution is very similar in sets B and R (Fig. 7). Set H differs in its frequencies of c and e, and also o, t and V.
voynich1-chardistcode.png
Figure 7: Character distribution in the three chosen subsets.
voynich1-wordlinecode.png
Figure 8: Word frequencies in the three chosen subsets.
Word frequency distribution is extremely skewed in set H, with daVn being extremely common (Fig. 8). Neither set R nor set B show such an extreme pattern.
The ten most common words in set H (by percent occurrence):
 daVn   col   cor  dain    dy   tcy qotcy    Co     s  tcor 
   16     3     3     2     2     2     2     2     2     2 
The ten most common words in set R:
   cedy     aVn  qokeey      ar      al qokeedy    daVn     cey    Cedy  qokaVn 
      2       2       2       2       2       2       2       1       1       1 
The ten most common words in set B:
     ol    Cedy    cedy  qokedy  qokain qokeedy     qol   qokal     Cey     cey 
      4       4       4       3       3       3       2       2       2       2 
The mean word length is slightly greater in set R than in set H (4.7 vs 4.0; Table 1). The mean number of times a word occurs is different among the groups as well - lowest in H, highest in B, with R having an intermediate position. The percentage of words appearing only once is similar in sets H and R, and a bit higher in set B.
Pages Chars Word occ. Word length Words N occ. Pct. Unique Words
H 24 6817 1686 4.0 685 2.5 30
R 23 50203 10746 4.7 3193 3.4 30
B 19 27738 6236 4.4 1413 4.4 35
Table 1: Subset Characteristics

4  Appendix 1: Changes to EVA 2006-09-13


%s/!\+//g
%s/,/\./g
%s/ch/c/g
%s/sh/C/g

%s/iiii/VV/g
%s/iii/W/g
%s/ii/V/g

%s/cfh/fc/g
%s/ckh/kc/g
%s/cph/pc/g
%s/cth/tc/g
%s/\.\+/\./g
%s/ \+/	-/
gg
s/-/=/

# don't have a good way to do the next step
# currently, alternate
# /=$
# map <F2> j^f^Ilr=

%s/=f\(.*\)-/=F\1-/
%s/=k\(.*\)-/=K\1-/
%s/=p\(.*\)-/=P\1-/
%s/=t\(.*\)-/=T\1-/
g/%/d
%s/{.\{-}}/X/g

%s/h//g

%s/-/\./g
%s/ \./ -/
%s/\.$/-/

%s/f\([0-9]\)\([rv]\)/f00\1\2/
%s/f\([0-9][0-9]\)\([rv]\)/f0\1\2/