Ruining Wordle
If you have spent any time on the internet the last few weeks, you are probably aware of Wordle. For the uninitiated, Wordle is an online code breaking/puzzle game. The game is beautifully simple, and that is probably why it has exploded in popularity. Despite the game’s simplicity, it definitely has some interesting strategies to unwrap.
So I decided to build some Python scripts to try different algorithms. I even added some Selenium to run the algorithms in the browser on its own. It still has not gotten old watching the program open a window and solve the daily puzzle.
Here you can clone the github repo and try it out for yourself.
Before we go any further let’s go over the rules of the game.
Wordle Rules
If you have ever played the classic code breaking board game Mastermind, then Wordle should feel familiar.
The goal of the game is to guess an unknown five-letter target word. You have six total guesses. After each guess you get some feedback. Letters that are in the target word and in the correct location will turn green, letters that are in the target word but are not in the correct location will turn yellow, and letters that are not in the target word will turn dark grey.
In the first guess of this example “R” is in the right location while “T” is in the target word, but is in the wrong location. “A”, “E”, and “S” are all not in the target word. After three guesses we correctly honed in on the target word: “ROBOT”.
Simple Enough!
The Process
So after playing this game for a few days my family was hooked. It was a great way for us to stay connected. The problem was, my dad was consistently getting to the solution in less guesses than I was. I couldn’t beat him on my own, so naturally I turned to Python.
To set the stage I found the list of words used in the game’s source code. There are two lists that this game uses. The first one is a list of the daily solutions (it is in order so do not look closely or you will ruin the game for yourself), the second is a list of all the possible words you can guess. There are 2,315 solution words and an additional 12,972 accepted words. This gives us a total set of 15,287 words to guess from.
After each guess the feedback trims down this set of words until we hone in on the target. My first attempt was as simple as possible:
- Select a random word from the word list as the guess
- Gather the feedback
- Use the feedback to trim down the word list
- Return to step one
Here is an example of that filter function. It eliminates any word that contains a grey letter. Then it eliminates any word that either doesn’t contain a yellow letter or that has the yellow letter in the same index. Lastly it eliminates any word that doesn’t have a green letter in the right spot.
Surprisingly, this simple “dumb” solution actually worked pretty well. Over 100 simulations looping through each target word, on average it was able to correctly guess 2,268.11 out of the 2,315 target words (missing about 47). Additionally, it took on average 4.06 guesses. What does this mean? If you just simply follow the rules and don’t guess words that disobey any of the previous feedback, you should almost always arrive at the solution. I was surprised by this, but we can do better.
The general strategy should be to limit the resulting search space (word list) as much as possible with each guess. One simple way to do this (which you have likely figured out yourself if you have played for a few days) is to avoid guessing words that have duplicate letters.
Let’s look at the example guess “EERIE”. If the target word does not contain an “E” (~54% of the target words do not contain an “E”) then the second two “E”s provide us no additional information. We are in essence wasting potential feedback.
So my first addition to the program was to do just this. As long as there are more than 10 potential words left, I filter any words with duplicate letters before randomly choosing a word. Thanks to Python’s list comprehension, this is an easy addition.
def remove_duplicate_letter_words(word_list): return([x for x in word_list if len(set(x)) == len(x)])
So does this improve anything? In the same 100 simulations, on average this new method correctly solved 2,274.93 out of the 2,315 target words (almost 7 more than before) and on average it took 3.99 guesses to solve the word. Both of these metrics are slightly better than before which means we are moving in the right direction. But we can do better!
Thinking about possible strategies for limiting the search space efficiently, I kept coming back to another classic game, “Guess Who”. If you have ever played you have probably implemented a similar strategy.
It’s much more effective to ask whether or not the target person is male than it is to ask if the person is wearing glasses. In the first scenario the answer will eliminate half the pool, whereas the second scenario will only eliminate three people (the three with glasses) unless you get very lucky. In a similar vein, guessing “Q” or “X” in Wordle could end up being a super lucky guess, but is more likely to be wrong and end up eliminating very few words.
With this in mind I created an incredibly simple scoring mechanism ranking the probability of each letter. How simple? Well, I looped through all of the possible words and counted each letter’s occurrence, then made a list of letters A-Z in ascending order by the amount of times they occurred in the set of words. That list ended up being:
['q', 'x', 'j', 'z', 'v', 'w', 'f', 'k', 'b', 'g', 'h', 'm', 'p', 'c', 'y', 'd', 'u', 'n', 't', 'l', 'i', 'r', 'o', 'a', 'e', 's']
Then to score a word I just summed up the indexes of each letter in the word. I used these scores to order the word list and make a more “educated” guess. With this scoring method, ‘aeros’, ‘arose’, and ‘soare’ are the three “best” starting words. I ran the simulation again with this new method and ‘arose’ as the starting word. It solved 2,276 out of the 2,315 words, which is only one more than before. However, when it did solve the word, it only took 3.79 guesses on average which is an improvement from 3.99. I was still a bit surprised that this tweak did not really improve the number of correct guesses. What could be causing this?
After examining some solutions, I realized this scoring system doesn’t take into account the position of the letter. This picture shows a good example of why this can cause issues.
The program had “_INCE” and guessed “Yince” (definition: A Scotch form of once) before “Mince” or “Wince”. It chose “Yince” because “Y” appears more in words than “M” or “W”. But “Y” is so popular because it occurs at the end of the words. “M” and “W” are much more common letters to start a word.
The program should have realized this and sorted “Mince” and “Wince” before “Yince”. To handle this I created a slightly more sophisticated scoring method.
For each word in the word list I compared it to every other word and kept a counter of how many matching letters there were at each index. For example, the counter for the word “Treat” looks like [814, 939, 881, 1073, 726]. Meaning “T” appears at the start of 814 other words, “R” is the second letter of 939 words, “E” is the third letter of 881 words and so on. In this new system “Yince” would be scored less than “Mince” and “Wince” because of how few words start with “Y”.
In this new example you can see the program did indeed guess “Wince” before “Yince” like we would expect. But I noticed another issue. The program had “_ _ NCE” after the 3rd guess, and then guessed “BUNCE”. This is a strange guess. According to the current logic “BU…” is more common than “MI…” or “WI…”.
However, there are more words with “_ INCE” than “_ UNCE”, so once the program knows “_ _ NCE”, it should guess a word with a second letter of “i”.
This all means that we need to be updating the scores as we trim the search space. Each loop we need to rescore with the words that are left!
With this new method, “Cares” is the highest scored word so that is what I will be starting with. This final program solved 2,303 out of the 2,315 words, on average it took 3.70 guesses to solve the word. This is ~9% more efficient than the original “dumb” algorithm. I was almost satisfied. Almost.
I was genuinely curious about the elusive 12 words the program could not guess:
[‘batty’, ‘folly’, ‘shade’, ‘gaunt’, ‘lower’, ‘boxer’, ‘hover’, ‘billy’, ‘bound’, ‘fight’, ‘batch’, ‘daunt’]
Is it possible to write an algorithm that is guaranteed to correctly guess all of the targets in only 6 guesses?
One thing popped out to me about all of these elusive words: very common endings. For example, 14 words end in “tty”, 22 end in “lly”, and unsurprisingly 141 end in “er”. Basically these words are tough to pin down because they are all similar to a lot of words. Thus, the search space is tougher to trim down. Let’s look at one specific example, “Fight”. The program took the following guesses:
- cares
- ponty
- wight
- tight
- might
- light
The program narrowed it down to “_ ight” in three guesses. The problem was there were 3 more guesses left and 4 possible words (tight, might, light, and fight). At this point the program should have realized that it would not have enough time to try them all, and instead it should have tried a random word that contained “m”, “f”, and “l” like “flame”. This guess would have been able to inform the program what the starting letter was.
So, yes, there probably is an algorithm that would guarantee a correct answer every time. At this point I went way further down the rabbit hole than I anticipated. Any further improvements would have required a fairly big lift, and probably some complicated combinatorics so…
Thanks for reading! Please let me know your strategy and favorite words.