[I wrote this originally as a thread on Twitter, but since I’ve stopped using that platform, I’m retroactively publishing it here, slightly reformatted.]
A thread about Wordle, letter frequencies, and how the choice of corpus matters for textual analysis.
I’ve only been playing Wordle for about two weeks, so I can’t claim any expertise, but it tickled some old interests of mine and then I got a little obsessed. /1
Wordle’s rules are similar to those of one of my favorite games as a kid, Mastermind, except Wordle requires that guesses and answers be words rather than colored pegs. This is a big difference, which I’ll say more about below! /2
It’s clear that the starting guess in Wordle is important and there are articles talking about picking good ones. (E.g., The Best Starting Words to Win at Wordle.) /3
The general idea is you want to use common letters for two reasons: one, they’re more likely to be in the answer, and two, if you learn they aren’t in the answer, you can eliminate a large swath of words. /4
That letter frequency analysis works in Wordle is one reason it’s different from Mastermind: there’s no a priori distribution of color pegs which can lead to a useful strategy. /5
Back in third or fourth grade, I got a copy of Herbert Zim’s Codes & Secret Writing from the Scholastic Book Club and its list of most common letters – ETOANRISH, in order – is cemented in my brain. (I have a a weaker sense that DLU come next.) /6
Based on that, when I started playing, my first pair of words would be ORATE and CHINS. They get all of Zim’s first nine, with three vowels in the first word and a “C” thrown in. Not too bad. /7
I switched for a few games to HATER/BISON for reasons that don’t make much sense (“B” is rarer than “C”), but missed having the three vowels in the first word. /8
Then I got a little empirical and asked myself “Is Herbert Zim’s 1966 frequency table still right?” Language doesn’t change very quickly, but this is 2022 and it’s easy to check for yourself. /9
I found free, sample corpus of news articles on the web that could be downloaded, did a very little of scripting and, for that corpus, the first half of the alphabet by frequency is ETAOINSRHLDCU. /10
(By the way, Peter Norvig, a friend, colleague, and mentor of mine across two jobs and my boss in my early years at Google, did a similar analysis based on Google Books with similar results.) /11
ETAOINSRHLDCU is pretty similar to Zim’s order – with I and S moving up a little and R moving down. Nothing too interesting yet. /12
But then I started thinking about Wordle specifically and the frequency of letters in general text is not that interesting for the game, since Wordle only uses five-letter words. /13
If we restrict to five-letter words in the same corpus, we get a very different distribution: EARTSOIHLNDUC. Vowels (except U) are relatively more popular and T and N drop a lot. /14
Why T and N? Well, here are the most common ten words in this article corpus, with counts:
96786 the
44855 to
44855 of
41891 and
35282 a
32805 in
17630 that
16638 is
16080 for
13759 on
/15
None of those words are five letters, but the distribution of letters in those words, because they’re so common – they’re the “head” of a “long-tailed” (or Zipfian) distribution – strongly affect the frequencies of letters in English text. /16
To give one example, looking at the full article corpus but dropping just the word “the,” the most common letters become EATOINSRLDHCU: without “the,” T drops from 2nd to 3rd and H from 9th to 11th. /17
Definite articles like “the” (or “le”/“la” in French) are very common words. The effect in English is probably larger than other languages because it’s non-gendered, so there’s only one such word. (“a”/“an” are split, but not due to gender.) /18
One anecdote about the outsized effect of “the” on English… /19
In its early days, Google didn’t index “the”: it was omitted from both docs and queries. Circa 2000 – before my time – they started indexing it, but to not reduce index size, that cost adding ~25 machines to each thousand-machine serving cluster for the extra data. /20
Getting back to Wordle, all these issues of word frequency actually make the letter frequency tables much less relevant, because Wordle is drawing from some list of five-letter words where the frequency of those words (probably) doesn’t matter. /21
Why not? My assumption is that Josh Wardle is choosing words that are common enough that people don’t say “Is that really a word?” but doesn’t care that COAST is twice as frequent as METAL in a corpus of news articles, Either COAST or METAL is a fine Wordle word. /22
This means computing letter frequencies on general text – even if restricted to just the five-letter words – will probably mislead you if you’re trying to guess words in Wordle. /23
These are the ten most frequent five-letter words in the corpus I’m using:
4994 their
3679 which
3609 there
3430 about
3218 would
2698 after
2292 other
2155 first
2065 years
1874 could
Lots of THs! /24
So, instead of using a corpus of articles, a simple wordlist with no frequency information should give more useful insights. I’m using the /usr/share/dict/words that comes with MacOS and dropping words that contain capital letters (proper nouns) or other symbols. /25
There are 8497 five-letter words on that list meeting my criteria. But, many of the words wouldn’t be fun if they were the answer in Wordle: AALII, TOYON, CHORT, LENIS, SERUT, etc. If I had to guess, at least 50% would be too rare for the game. /26
But, which 50%? I thought about intersecting the list with words that passed some frequency threshold in the news corpus, but this is just an approximate exercise, so I went with the list as is, TOYONs and all. /27
The most common half of the alphabet in that list is AEROISTLNUYCD. That’s quite different from Zim’s list. A is more frequent than E! T and H are way down in popularity! (H drops to #14.) /28
Based on this list, I revised my opening Wordle guess to AROSE. Unless I have some insights based on the results of my first guess, I follow that with UNTIL (or UNLIT). /29
Can one do better? Well, machines can, especially if they know the word list (or a reasonable superset). How? By exhaustively keeping track of which words are impossible once the results are known from previous guesses. /30
(Spoiler alert: what follows are examples from a few days ago. By the time you’re reading this, if you’re playing Wordle, you’ll have safely seen them. But maybe you’re somehow saving Wordle games for later?) /31
For example, if today’s word were POLAR and you guessed AROSE, the YELLOW/YELLOW/YELLOW/GREY/GREY result only matches 117 words from my dictionary. /32
Similarly, if today’s word were SHIRE and you guessed AROSE, you’d get a GREY/YELLOW/GREY/YELLOW/GREEN that would match only 25 words on that list. /33
For a human, picking what word would distinguish best among those 117 or 25 words is a hard task, but a computer can just try all the words and see what reduces the number of possibilities most for the next round. It turns out, the space of words is very sparse. /33 [Oops! Two 33s!]
This gets at another difference between Wordle and Mastermind. In Mastermind, any combination of colored pegs is acceptable. In Wordle, it’s only (roughly, at most) 8497 words out of 26^5 = 11,881,376 possible combinations of five letters, or roughly 0.07%. /34
Based on some automated exploration, I think RAISE may be the “optimal” initial guess if the word list I’m using is close enough to the actual one used in Wordle. But, there’s not as “nice” a generic followup guess for it as AROSE/UNTIL; COUNT or MOUNT come close. /35
At worst, after guessing RAISE/COUNT, there would only be 98 possible word choices, for my word list. With MOUNT as the second word, it’s 90. HOTLY would reduce the possible set to 83 words and POUTY or PYLON, 89. /36
Since I switched to RAISE/COUNT as my default guesses, I’ve been solving the daily Wordles in three or four guesses, which feels a little better than before. But it may just be luck of the draw, in terms of the word of the day. /37
Summary:
- While Wordle has similar rules to Mastermind, using words makes it very different.
- Letter frequencies based on list of unique words versus real text differ a lot.
- Only ~0.07% of five-letter combinations are words.
- AROSE/UNTIL or RAISE/COUNT.
/38 (fin)