I’m often asked how my fonts compare to Times New Roman or other common system fonts in terms of how many words they fit per page.
Triplicate, for instance, fits as many words per page as Courier because they’re both monospaced fonts, and thus both conform to the typewriter convention that each character is 6/10 of an em wide. (Recall that the em is the height of the notional bounding box of the font. It’s always scaled to the current point size, so at 12 point, 6/10 of an em = 7.2 points.)
But that’s not true of proportional fonts like Equity and Times New Roman, because their character widths vary. I know from experience that Equity and Times New Roman fit roughly the same number of words per page. But since point size alone doesn’t capture their comparative copyfitting characteristics, could we quantify this another way? Let’s call it the Comparative Copyfitting Factor (CCF).
A simple answer might be to type out every character of each font and measure the length of these two character sets. Then the CCF would be the ratio of these two measurements. The problem is that most of these characters are seldom used in body text. We’d end up measuring a lot of characters that don’t have any bearing on copyfitting.
We might then observe that most body text is made of lowercase letters. What if we just measured the lowercase alphabets? A better idea. But still flawed, because every letter would appear once in our measure, and thus be weighted equally. In the real world, letter frequency varies. Ideally, we would use a sample string that was correctly weighted for statistical letter frequency.
The issue of letter frequency arises repeatedly in the history of typography. Letters that are more common need special treatment. For instance, when type was cast in metal, it wouldn’t have made sense for type founders to furnish an equal number of every letter. Instead, fonts were shipped with more copies of the common letters and fewer of the uncommon ones. This is also why the left two columns of a Linotype keyboard were ETAOIN and SHRDLU: grouping the common letters made it possible to operate the machine faster.
Measuring letter frequency with a computer, of course, is much easier than by hand. The largest such effort is probably the one conducted by Peter Norvig, who searched through a massive trove of digitized books, counting 3,563,505,777,820 letter occurrences. For instance, he found 445.2 billion occurrences of the most common letter (“e”, naturally) and at the other end, 3.2 billion occurrences of “z”.
E 445.2bn 12.49%
T 330.5bn 9.28%
A 286.5bn 8.04%
O 272.3bn 7.64%
I 269.7bn 7.57%
N 257.8bn 7.23%
S 232.1bn 6.51%
R 223.8bn 6.28%
H 180.1bn 5.05%
L 145.0bn 4.07%
D 136.0bn 3.82%
C 119.2bn 3.34%
U 97.3bn 2.73%
M 89.5bn 2.51%
F 85.6bn 2.40%
P 76.1bn 2.14%
G 66.6bn 1.87%
W 59.7bn 1.68%
Y 59.3bn 1.66%
B 52.9bn 1.48%
V 37.5bn 1.05%
K 19.3bn 0.54%
X 8.4bn 0.23%
J 5.7bn 0.16%
Q 4.3bn 0.12%
Z 3.2bn 0.09%
How can we make a single string that captures this frequency information? We don’t need to type out trillions of letters. Instead, let’s normalize the occurrence counts by dividing all of them by 3.2 billion. That means “z” will appear once in our sample string, and the other letters will appear proportionately more based on their comparatively greater frequency. If I’m doing this right, we get this string of 1187 characters:
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeettttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaoooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiinnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnsssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhlllllllllllllllllllllllllllllllllllllllllllllllldddddddddddddddddddddddddddddddddddddddddddddccccccccccccccccccccccccccccccccccccccccuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuummmmmmmmmmmmmmmmmmmmmmmmmmmmmmfffffffffffffffffffffffffffffpppppppppppppppppppppppppggggggggggggggggggggggwwwwwwwwwwwwwwwwwwwwyyyyyyyyyyyyyyyyyyyybbbbbbbbbbbbbbbbbbvvvvvvvvvvvvvkkkkkkxxxjjqz
In body text, however, we also need to consider word spaces, which occur frequently and take up non-negligible space. According to Norvig, the average word length is 4.79 letters. So in our sample of 1187 letters, we can expect to see 1187 / 4.79 ≈ 248 word spaces. We add those to the sample string above, for a total of 1435 characters. Measuring this relatively short string should give a us a very good estimate of the copyfitting characteristics of any font.
(By the way, we’re deliberately ignoring capital letters, punctuation, etc. on the idea that these characters occur so infrequently in body text that they won’t meaningfully affect CCF.)
Here’s a selection of MB fonts and common system fonts, along with the length of our sample string (denominated in ems) and then sorted in decreasing order:
Bookman 701.640
Century Schoolbook 661.401
Book Antiqua 632.806
Arial 629.010
Valkyrie 628.063
Charter 625.379
Century Supra 613.393
Bell 581.680
Calibri 579.634
Times New Roman 570.381
Equity 565.521
Goudy Oldstyle 562.483
Garamond 555.012
This reveals something that every college student and lawyer has probably figured out on their own: Bookman is a hog; Garamond is the most svelte. But we wanted a Comparative Copyfitting Factor. So let’s deem Times New Roman to be 1.0 and express the others relative to this benchmark:
Bookman 1.2301
Century Schoolbook 1.1595
Book Antiqua 1.1094
Arial 1.1027
Valkyrie 1.1011
Charter 1.096
Century Supra 1.075
Bell 1.019
Calibri 1.016
Times New Roman 1.0
Equity 0.9914
Goudy Oldstyle 0.9861
Garamond 0.9730
Truth in advertising: Equity’s CCF of 0.9914 means it fits essentially the same number of words per page as Times New Roman. Meanwhile, many lawyers remain enamored of Century Schoolbook, even though it has a rather bad CCF of 1.1595.
Another way of reading this chart is to think of the CCF as the point size multiplier needed for two fonts to match. For instance, a page of 12-point Bookman will contain roughly the same number of words as Times New Roman at 14.76 point (because Bookman’s CCF is 1.23, and 12 × 1.23 = 14.76).
I resisted putting a chart like this in Typography for Lawyers because of the potential for abuse. My advice has always been to interpret court rules in good faith, which means noticing that these rules are meant to promote legibility and fairness. Lawyers who undermine these goals by deliberately picking the lowest-CCF font time after time—are you new here?
Still, point-size shenanigans could be eliminated entirely by denominating document length in terms of word count rather than page count, a technique best left behind with the rest of the typewriter era.
PS to supernerds: you might wonder whether it’s possible to construct a shorter string that produces statistically equivalent CCF results. (Yes—I think it can be at least 90% shorter, based on my experiments.) Also, is there a reasonable computational technique for converting one of these sample strings into a readable list of words (that is, an anagram)?