Voynich Manuscript - Basic Analyses
Sarah Goslee
2006-10-22
1 Introduction
I am working from a modified EVA transcription of the VMS (Reeds/Landini's interlinear file in EVA, version 1.6e6 - http://www.ic.unicamp.br/ stolfi/voynich/98-12-28-interln16e6/). Starting with that file, I constructed a consensus version, and I've fixed some apparent transcription errors by comparing the transcription with the high-resolution SIDs available courtesy of the Beinecke Rare Book and Manuscript Library of Yale University, current owners of the manuscript. (I downloaded them from the links available at www.voynich.nu.)
I've also made some other changes to match my conception of the symbol set (in no
particular order):
- ch -> c because I think the c-ligature-c combination is one character.
- sh -> C I'm not sure of the meaning of the c-ligatureswirl-c - it seems to behave
very much like the regular c - but am keeping it separate.
- paragraph initial gallows (fkpt) -> capitals FKPT - I think these have
some other meaning and are not part of the following word.
- All possible word breaks (,) have been marked as definite word breaks (.). I'm
continuing to check these as I proofread the transcription.
- = used to mark paragraph / label beginnings and endings, and - used to mark
other line beginning and endings.
- Internal - marking intruding images have been replaced with word breaks (.)
- I've üntangled" the ch-gallows combinations, putting the gallows letter first.
I think they are scribal conceits, and this idea is supported by the appearance of
gallows-ch combinations at the beginning of paragraphs.
- "weirdos" are all denoted by X
A complete list of changes suitable for running as a source file in vim is contained in appendix 1.
2 Ordination
First, I looked at overall pattern of character frequencies (Fig.
1) and word frequencies (Fig.
2)
between Currier languages and section types, using principal coordinates ordination on
Euclidean distances of row-standardized frequencies. Labeling the same ordinations with
section types as well as Currier language (Figs.
3,
4) showed
that the language groups separate for both character and word frequencies (although not cleanly),
but that thematic (topical(?) section) groups separate more clearly for word frequencies. This suggests to
me that character frequency is determined by A/B ëncoding", while word frequencies are more
closely related to theme.
Figure 1: Ordination of character frequencies in Currier A and Currier B pages (paragraph text only).
Figure 2: Ordination of word frequencies in Currier A and Currier B pages (paragraph text only).
Figure 3: Ordination of character frequencies labeled by section and Currier language (paragraph text only).
Figure 4: Ordination of word frequencies labeled by section and Currier language (paragraph text only).
3 Analysis of Distinct Subsets
I've decided to concentrate my initial analyses on three subsets of data chosen to be distinct based
on supposed content and on separation in the ordination diagrams (especially word frequencies).
Set R is the recipe section, set B is the balneological section, and
Set H is the herbal section in the right tail of the word-frequency PCO. This
gives two thematically-different sections in language B, and only one in
language A. These groups are shown overlaid on the character ordination (Fig.
5) and the word
ordination (Fig.
6).
Figure 5: Ordination of character frequencies with group membership superimposed.
Figure 6: Ordination of word frequencies with group membership superimposed.
As suggested by the ordination diagram, the character distribution is very similar in sets
B and R (Fig.
7). Set H differs in its frequencies of c and e, and also o, t and V.
Figure 7: Character distribution in the three chosen subsets.
Figure 8: Word frequencies in the three chosen subsets.
Word frequency distribution is extremely skewed in set H, with
daVn being extremely common (Fig.
8). Neither set R nor set B show such an extreme pattern.
The ten most common words in set H (by percent occurrence):
daVn col cor dain dy tcy qotcy Co s tcor
16 3 3 2 2 2 2 2 2 2
The ten most common words in set R:
cedy aVn qokeey ar al qokeedy daVn cey Cedy qokaVn
2 2 2 2 2 2 2 1 1 1
The ten most common words in set B:
ol Cedy cedy qokedy qokain qokeedy qol qokal Cey cey
4 4 4 3 3 3 2 2 2 2
The mean word length is slightly greater in set R than in set H (4.7 vs 4.0; Table
1).
The mean number of times a word occurs is different among the groups as well - lowest in H, highest in
B, with R having an intermediate position. The percentage of words appearing only once is similar in sets H
and R, and a bit higher in set B.
| Pages | Chars | Word occ. | Word length | Words | N occ. | Pct. Unique Words |
H | 24 | 6817 | 1686 | 4.0 | 685 | 2.5 | 30 |
R | 23 | 50203 | 10746 | 4.7 | 3193 | 3.4 | 30 |
B | 19 | 27738 | 6236 | 4.4 | 1413 | 4.4 | 35 |
Table 1: Subset Characteristics
4 Appendix 1: Changes to EVA 2006-09-13
%s/!\+//g
%s/,/\./g
%s/ch/c/g
%s/sh/C/g
%s/iiii/VV/g
%s/iii/W/g
%s/ii/V/g
%s/cfh/fc/g
%s/ckh/kc/g
%s/cph/pc/g
%s/cth/tc/g
%s/\.\+/\./g
%s/ \+/ -/
gg
s/-/=/
# don't have a good way to do the next step
# currently, alternate
# /=$
# map <F2> j^f^Ilr=
%s/=f\(.*\)-/=F\1-/
%s/=k\(.*\)-/=K\1-/
%s/=p\(.*\)-/=P\1-/
%s/=t\(.*\)-/=T\1-/
g/%/d
%s/{.\{-}}/X/g
%s/h//g
%s/-/\./g
%s/ \./ -/
%s/\.$/-/
%s/f\([0-9]\)\([rv]\)/f00\1\2/
%s/f\([0-9][0-9]\)\([rv]\)/f0\1\2/