I’ve recently written a series of blog posts about Gherkin, the Behaviour-driven development movement, and how Cucumber (the BDD tool of choice) failed to perform to expectations.
I wanted to showcase the BDD-inspired low-tech solution I came up with via a toy project, demonstrating a small but significant programming task, broken down as series of design-implementation cycles.
Wordle is a perfect target: it’s a small codebase, with a half dozen features to string together into a useable game.
In order to document the process, the code is written via literate programming.
Literate programming is the art of writing code as if it was a novel (or blogpost), writing down what’s needed, explaining the reasoning, and weaving in code snippets that add up to the codebase as we grow in understanding. The result is a “story” which can be read, but also “tangled” back into a proper codebase that works normally.
For more context on the code repository (how to use, etc), please read the project readme.
See also the online, pretty rendered version of this document on my personal website: https://jiby.tech/project/literate_wordle/wordle.html
To get us started, let’s cover the very first behaviour Wordle has to do: pick a word that will become our secret answer.
As the first iteration in a test-driven project, it’s important that we set up all the components we’ll need going forwards.
First, let’s formalise a little our first requirement, using Gherkin Features. For context as to why/how we’re doing this, read my post on gathering requirements via Gherkin.
Feature: Pick an answer word
As a Wordle game
I need to pick a random 5 letter word
In order to let players guess it
Right. That’s fairly straightforward, but the secret word can’t just be random characters, it needs to be a proper word. So we need to find a dictionary to pick from.
We want to write a test that validates that we can indeed pick a random word. But “Random” and “test” together should make anybody wince at the idea of non-deterministic testing.
We could write a test that picks a word, then confirm the word came from the dictionary file, but writing test would mean re-implementing the entirety of the feature we’re testing, as well as rely on the internals of the implementation being correct. That’s very wrong.
A good alternative would be to pin down the randomness (making the test deterministic) by anchoring the randomness seed to known value, allowing repeatable testing. But this is just the first test in a new project, so we want a simple check to start with, so we compromise by making the assertion “is the random word picked of five letter length”?
So we write down a new test file, under tests/
folder, starting with a
file-level docstring that references the Gherkin feature this enforces.
"""Validates the Gherkin file features/pick_answer_word.feature:
Feature: Pick an answer word
As a Wordle game
I need to pick a random 5 letter word
In order to let players guess it
"""
from literate_wordle.words import pick_answer_word
def test_pick_word_ok_length():
"""Confirm a wordle solution is of right size"""
assert len(pick_answer_word()) == 5, "Picked wordle solution is wrong size!"
Of course, since that feature isn’t implemented (not even the module’s skeleton), running tests right now would crash as import errors, rather than give a red light.
So let’s implement the barest hint of the pick_answer_word
function that
returns the wrong thing, to make the test run and fail:
"""Dictionary features to back wordle solutions"""
In that module, let’s add the skeleton for our pick_answer_word
function, but
return an invalid result, to make test explicitly fail:
def pick_answer_word() -> str:
"""Pick a Wordle solution/answer from wordle dictionary"""
return "" # Incorrect solution to get RED test
With our test ready, and a dummy function in place, let’s see the tests go red:
make test
poetry run pytest ============================= test session starts ============================== platform linux -- Python 3.9.5, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/jiby/dev/ws/short/literate_wordle/.venv/bin/python cachedir: .pytest_cache rootdir: /home/jiby/dev/ws/short/literate_wordle, configfile: pyproject.toml plugins: cov-3.0.0, datadir-1.3.1, clarity-1.0.1 collecting ... collected 2 items tests/test_pick_word.py::test_pick_word_ok_length FAILED [ 50%] tests/test_version.py::test_version PASSED [100%] =================================== FAILURES =================================== ___________________________ test_pick_word_ok_length ___________________________ def test_pick_word_ok_length(): """Confirm a wordle solution is of right size""" > assert len(pick_answer_word()) == 5, "Picked wordle solution is wrong size!" E AssertionError: Picked wordle solution is wrong size! E assert == failed. [pytest-clarity diff shown] E E LHS vs RHS shown below E E 0 E 5 E tests/test_pick_word.py:13: AssertionError - generated xml file: /home/jiby/dev/ws/short/literate_wordle/test_results/results.xml - =========================== short test summary info ============================ FAILED tests/test_pick_word.py::test_pick_word_ok_length - AssertionError: Pi... ========================= 1 failed, 1 passed in 0.07s ========================== make: *** [Makefile:16: test] Error 1
As pytest mentions, we should see a wordle solution of 5 letters, not zero. So the test indeed failed as expected, we can now make it pass by implementing the feature.
Taking a quick step back, think of how conveniently TDD lets us “dream up an API”, by describing functions and files that don’t need to exist yet.
Since we’re trying to match the Wordle website’s implementation, let’s reuse Wordle’s own dictionary. Someone helpfully uploaded it. Let’s download it:
wget \
--output-document "wordle_answers_dict.txt" \
"https://raw.githubusercontent.com/AllValley/WordleDictionary/6f14d2f03d01c36fe66e3ccc0929394251ab139d/wordle_solutions_alphabetized.txt"
Except an alphabetically sorted text file takes space for no good reason. Let’s compress it preventively.
While this can legitimately be seen as a premature optimization, we can see this as trying to “flatten” a static text file into a binary “asset” that can be packaged into the project’s package, like icons are part of webapps.
ANSWERS_FILE="wordle_answers_dict.txt"
# Get raw file size in kilobytes
du -k "${ANSWERS_FILE}"
# Compress the file (removes original)
gzip "$ANSWERS_FILE"
# Check size after compression
du -k "${ANSWERS_FILE}.gz"
16 wordle_answers_dict.txt 8 wordle_answers_dict.txt.gz
Sweet, we have cut down the filesize by half.
At first glance, the implementation of the function we want is simple, it looks roughly like this:
with open("my_dictionary.txt", "r") as fd:
my_text = fd.read()
One just needs to find the right file path to open, just add sprinkles to deal with compression. Sure enough, that is fairly easy.
The issue is that we’re trying to write a python package here, which means it could
be downloaded via pip install
and installed in an arbitary location on
someone’s computer. Our code needs to refer to the file as “the file XYZ inside
the assets folder of our package”. We need to look up how to express that.
From Stackoverflow on reading static files from inside Python package, we can
use the importlib.resources
module, since our project requires Python 3.9
onwards.
So we’ll move our dictionary zip file into a new module (folder) called
assets
, which will be a proper python module that can be imported from:
mkdir -p src/literate_wordle/assets/
# A proper python module means an __init__.py: Give it a docstring
echo '"""Static binary assets (dictionaries) required to perform Wordle"""' > src/literate_wordle/assets/__init__.py
mv wordle_answers_dict.txt.gz src/literate_wordle/
With the file in correct position, let’s redefine the words
module we left empty, to provide the pick_answer_word
function.
"""Dictionary features to back wordle solutions"""
import gzip
import importlib.resources as pkg_resources
from . import assets # Relative import of the assets/ folder
We need a convenience function to load the zip file into a list of strings.
def get_words_list() -> list[str]:
"""Decompress the wordle dictionary"""
dict_compressed_bytes = pkg_resources.read_binary(
assets, "wordle_answers_dict.txt.gz"
)
dict_string = gzip.decompress(dict_compressed_bytes).decode("ascii")
answer_word_list = [word.strip().lower().strip() for word in dict_string.split("\n")]
return answer_word_list
Ideally we would make a test dedicated for proving this function, but our already-failing acceptance test is pretty much covering this entire feature, so it’s not worth it just now. This is one of those tradeoffs we make between toy projects and long-term maintainability of code as a team.
With the word list in hand, writing out the pick function is trivial:
from random import choice
def pick_answer_word() -> str:
"""Pick a single word out of the dictionary of answers"""
return choice(get_words_list())
With the function implemented, we can try it out in a Python REPL (Read Eval Print Loop, also known as an interactive interpreter):
poetry run python3
>> from literate_wordle import words
>> print(words.pick_answer_word())
stink
>> print(words.pick_answer_word())
blank
Perfect! So the test should now pass, right?
make test
poetry run pytest ============================= test session starts ============================== platform linux -- Python 3.9.5, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/jiby/dev/ws/short/literate_wordle/.venv/bin/python cachedir: .pytest_cache rootdir: /home/jiby/dev/ws/short/literate_wordle, configfile: pyproject.toml plugins: cov-3.0.0, datadir-1.3.1, clarity-1.0.1 collecting ... collected 2 items tests/test_pick_word.py::test_pick_word_ok_length PASSED [ 50%] tests/test_version.py::test_version PASSED [100%] - generated xml file: /home/jiby/dev/ws/short/literate_wordle/test_results/results.xml - ============================== 2 passed in 0.03s ===============================
Acceptance tests pass, and linters are happy (not pictured, use make
to
check).
Because the acceptance test pass, that means the feature is ready to ship! That’s the BDD guarantee.
Of course, keen readers will notice sub-optimal code, like how we’re unzipping the entire solutions file on each requested answer. Because “picking a solution word” is something done on the order of once over the entire runtime of a Wordle session, we choose to leave this performance wart be.
We just completed our first loop: determine a small component that needs implemented to build towards the Wordle goal, spell it out with Gherkin features, explicit the feature via acceptance test, and iterate on the new RED test until it becomes green, then ship the feature.
Common TDD workflow adds a refactor or “blue” component to the cycle, which is
indeed necessary for production code, as it lends maintainability (the first
draft of a codebase is usually taking big shortcuts). But this project is
meant as entertainment material, and proper refactoring would mean refactoring the wordle.org
source file, which would drown out the nice narrative we’re building here, so
let’s leave it here.
Along the way, the code blocks spelled out in this narrative-oriented file is
tangled out into proper code paths, so that the Makefile
can pick it up and
validate the proper package-ness. We’ll see as we implement the next feature how
such a weaving of code snippets works.
Now that we can pick secret words, we need to start processing guesses. The very first thing we need is validating guesses are proper words, and of the right size. This feature will give us a familiar context (dictionaries), while slowly ramping up the details of the Gherkin features:
Feature: Checking a guess is a valid word
As a Wordle game
I need to confirm each guessed word is valid
So that I only accept real words, no kwyjibo
In practice, this means multiple things:
Scenario: Reject long words
When guessing "affable"
Then the guess is rejected
And reason for rejection is "Guess too long"
Scenario: Reject short words
When guessing "baby"
Then the guess is rejected
And reason for rejection is "Guess too short"
Scenario: Reject fake words via dictionary
When guessing "vbpdj"
Then the guess is rejected
And reason for rejection is "Not a word from the dictionary"
Scenario: Accept five letter dictionary words
When guessing "crane"
Then the guess is accepted
So, with a feature covering these scenarios, we can start laying out acceptance tests.
Since I quite like to use the Gherkin feature file inside the docstrings of Python tests, I’m going to take advantage of having already written the feature above, to reference it, so I can template it out in code snippets:
"""Validates the Gherkin file features/checking_guess_valid_word.feature:
<<feature-check-valid-guess>>
"""
Just this once, I’ll show how the templating happens behind the scene:
"""Validates the Gherkin file features/checking_guess_valid_word.feature:
<<feature-check-valid-guess>>
<<scenario-check-valid-guess>>
"""
With the feature described, let’s import our hypothetical test code
from literate_wordle.words import check_valid_word
def test_reject_long_words():
"""Scenario: Reject long words"""
# When guessing "affable"
guess = "affable"
is_valid, reject_reason = check_valid_word(guess)
# Then the guess is rejected
assert not is_valid, "Overly long guess should have been rejected"
# And reason for rejection is "Guess too long"
assert reject_reason == "Guess too long"
Notice the pattern of referencing the Gherkin Scenario as comments inside the test. This practice is something I came up with on my own after being a bit disappointed with Cucumber. You can read more about it in my post on low-tech cucumber replacement.
def test_reject_overly_short_words():
"""Scenario: Reject short words"""
# When guessing "baby"
guess = "baby"
is_valid, reject_reason = check_valid_word(guess)
# Then the guess is rejected
assert not is_valid, "Overly short guess should have been rejected"
# And reason for rejection is "Guess too short"
assert reject_reason == "Guess too short"
And finally, the dictionary checks:
def test_reject_nondict_words():
"""Scenario: Reject fake words via dictionary"""
# When guessing "vbpdj"
guess = "vbpdj"
is_valid, reject_reason = check_valid_word(guess)
# Then the guess is rejected
assert not is_valid, "Word not in dictionary should have been rejected"
# And reason for rejection is "Not a word from the dictionary"
assert reject_reason == "Not a word from the dictionary"
def test_accept_dict_words():
"""Scenario: Accept five letter dictionary words"""
# When guessing "crane"
guess = "crane"
is_valid, reject_reason = check_valid_word(guess)
# Then the guess is accepted
assert is_valid, "Correct length word in dictionary should have been accepted"
One tiny detail regarding this last example, which highlights why separating
Gherkin from actual code is important: We describe in the positive scenario the
need to accept a correct word in terms of “not rejecting”, which in code maps to
the is_valid
boolean. That’s suffficient to validate the originalGherkin
scenario, which is what we think of when designing the software.
But as we see in the implementation, there’s also the matter of the
reject_reason
component, which we should check for emptiness. That emptiness is an
implementation detail, which has no reason to be laid out in the original
scenario, but is still valid to make assertions on as part of the
implementation’s check. So we add the following line to the test:
assert reject_reason is None, "Accepted word should have no reason to be rejected"
With all these (high level) tests in hand, let’s write up some small implementation to get RED tests instead of a crash.
First up is defining the function’s signature: Simple enough, we take a string guess
in, and return a boolean and a string for justification. Except sometimes (as
seen in Listing reject-reason-none) the reason is None
, so that’s more of an
Optional
string, which we’ll need to import.
from typing import Optional
def check_valid_word(guess: str) -> tuple[bool, Optional[str]]:
"""Pretends to check if guess is a valid word"""
return False, "Not implemented"
All right, so we have tests, let’s see them fail!
make test 2>&1 || true
poetry run pytest ============================= test session starts ============================== platform linux -- Python 3.9.5, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/jiby/dev/ws/short/literate_wordle/.venv/bin/python cachedir: .pytest_cache rootdir: /home/jiby/dev/ws/short/literate_wordle, configfile: pyproject.toml plugins: cov-3.0.0, clarity-1.0.1 collecting ... collected 5 items tests/test_checking_guess_valid_word.py::test_reject_long_words FAILED [ 20%] tests/test_checking_guess_valid_word.py::test_reject_overly_short_words FAILED [ 40%] tests/test_checking_guess_valid_word.py::test_reject_nondict_words FAILED [ 60%] tests/test_checking_guess_valid_word.py::test_accept_dict_words FAILED [ 80%] tests/test_pick_word.py::test_pick_word_ok_length PASSED [100%] =================================== FAILURES =================================== ____________________________ test_reject_long_words ____________________________ def test_reject_long_words(): """Scenario: Reject long words""" # When guessing "affable" guess = "affable" is_valid, reject_reason = check_valid_word(guess) # Then the guess is rejected assert not is_valid, "Overly long guess should have been rejected" # And reason for rejection is "Guess too long" > assert reject_reason == "Guess too long" E assert == failed. [pytest-clarity diff shown] E E LHS vs RHS shown below E E Not implemented E Guess too long E tests/test_checking_guess_valid_word.py:39: AssertionError ________________________ test_reject_overly_short_words ________________________ def test_reject_overly_short_words(): """Scenario: Reject short words""" # When guessing "baby" guess = "baby" is_valid, reject_reason = check_valid_word(guess) # Then the guess is rejected assert not is_valid, "Overly short guess should have been rejected" # And reason for rejection is "Guess too short" > assert reject_reason == "Guess too short" E assert == failed. [pytest-clarity diff shown] E E LHS vs RHS shown below E E Not implemented E Guess too short E tests/test_checking_guess_valid_word.py:50: AssertionError __________________________ test_reject_nondict_words ___________________________ def test_reject_nondict_words(): """Scenario: Reject fake words via dictionary""" # When guessing "vbpdj" guess = "vbpdj" is_valid, reject_reason = check_valid_word(guess) # Then the guess is rejected assert not is_valid, "Word not in dictionary should have been rejected" # And reason for rejection is "Not a word from the dictionary" > assert reject_reason == "Not a word from the dictionary" E assert == failed. [pytest-clarity diff shown] E E LHS vs RHS shown below E E Not implemented E Not a word from the dictionary E tests/test_checking_guess_valid_word.py:61: AssertionError ____________________________ test_accept_dict_words ____________________________ def test_accept_dict_words(): """Scenario: Accept five letter dictionary words""" # When guessing "crane" guess = "crane" is_valid, reject_reason = check_valid_word(guess) # Then the guess is accepted > assert is_valid, "Correct length word in dictionary should have been accepted" E AssertionError: Correct length word in dictionary should have been accepted E assert False tests/test_checking_guess_valid_word.py:70: AssertionError - generated xml file: /home/jiby/dev/ws/short/literate_wordle/test_results/results.xml - ----------- coverage: platform linux, python 3.9.5-final-0 ----------- Name Stmts Miss Cover ------------------------------------------------------------ src/literate_wordle/__init__.py 1 0 100% src/literate_wordle/assets/__init__.py 0 0 100% src/literate_wordle/words.py 14 0 100% ------------------------------------------------------------ TOTAL 15 0 100% Coverage HTML written to dir test_results/coverage.html Coverage XML written to file test_results/coverage.xml =========================== short test summary info ============================ FAILED tests/test_checking_guess_valid_word.py::test_reject_long_words - asse... FAILED tests/test_checking_guess_valid_word.py::test_reject_overly_short_words FAILED tests/test_checking_guess_valid_word.py::test_reject_nondict_words - a... FAILED tests/test_checking_guess_valid_word.py::test_accept_dict_words - Asse... ========================= 4 failed, 1 passed in 0.13s ========================== make: *** [Makefile:16: test] Error 1
Test failure as expected, and enjoy that 100% coverage![fn::Obviously coverage metric is a very fuzzy number which doesn’t guarantee much, but most well maintained code has a tendency to have good coverage, because the features are well tested. It’s a correlation-metric, nothing more. In our case, we’re doing TDD (test goes first indeed) and we’re pushing this even more to explicit our user requirements as acceptance tests, it should be no surprise the coverage gets good.]
Let’s implement the proper feature. First of all, we replace the function stub’s body to do only guess-length checks, run tests against it. Since we implement half the feature (by Scenarios), we should be seeing half as many tests fail as before.
"""Check wordle guess length only, no dict checks"""
answer_length = 5
guess_length = len(guess)
if guess_length < answer_length:
return False, "Guess too short"
if guess_length > answer_length:
return False, "Guess too long"
return True, None # No dictionary check
make test 2>&1 || true
poetry run pytest ============================= test session starts ============================== platform linux -- Python 3.9.5, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/jiby/dev/ws/short/literate_wordle/.venv/bin/python cachedir: .pytest_cache rootdir: /home/jiby/dev/ws/short/literate_wordle, configfile: pyproject.toml plugins: cov-3.0.0, clarity-1.0.1 collecting ... collected 5 items tests/test_checking_guess_valid_word.py::test_reject_long_words PASSED [ 20%] tests/test_checking_guess_valid_word.py::test_reject_overly_short_words PASSED [ 40%] tests/test_checking_guess_valid_word.py::test_reject_nondict_words FAILED [ 60%] tests/test_checking_guess_valid_word.py::test_accept_dict_words PASSED [ 80%] tests/test_pick_word.py::test_pick_word_ok_length PASSED [100%] =================================== FAILURES =================================== __________________________ test_reject_nondict_words ___________________________ def test_reject_nondict_words(): """Scenario: Reject fake words via dictionary""" # When guessing "vbpdj" guess = "vbpdj" is_valid, reject_reason = check_valid_word(guess) # Then the guess is rejected > assert not is_valid, "Word not in dictionary should have been rejected" E AssertionError: Word not in dictionary should have been rejected E assert not True tests/test_checking_guess_valid_word.py:59: AssertionError - generated xml file: /home/jiby/dev/ws/short/literate_wordle/test_results/results.xml - ----------- coverage: platform linux, python 3.9.5-final-0 ----------- Name Stmts Miss Cover ------------------------------------------------------------ src/literate_wordle/__init__.py 1 0 100% src/literate_wordle/assets/__init__.py 0 0 100% src/literate_wordle/words.py 19 0 100% ------------------------------------------------------------ TOTAL 20 0 100% Coverage HTML written to dir test_results/coverage.html Coverage XML written to file test_results/coverage.xml =========================== short test summary info ============================ FAILED tests/test_checking_guess_valid_word.py::test_reject_nondict_words - A... ========================= 1 failed, 4 passed in 0.11s ========================== make: *** [Makefile:16: test] Error 1
Progress! Four of five tests pass[fn::Since the remaining two tests we didn’t
implement code for each check one of the is_valid
boolean, it’s normal that we spuriously pass
one of the remaining tests, because our dummy function returns the same boolean
answer always, and a broken clock is right twice a day.], so we now need the dictionary.
Note that in Wordle’s original implementation, the list of possible solutions is a subset of the word dictionary used for guess validation. We previously loaded the answers, now we need the larger set of accepted words. While it does mean there will be duplicate entries, we’re talking single-digit kilobytes, we can afford that.
We fetch the dictionary like before:
wget \
--output-document "src/literate_wordle/assets/wordle_accepted_words_dict.txt" \
"https://raw.githubusercontent.com/AllValley/WordleDictionary/6f14d2f03d01c36fe66e3ccc0929394251ab139d/wordle_complete_dictionary.txt"
And compress it too
ANSWERS_FILE="src/literate_wordle/assets/wordle_accepted_words_dict.txt"
du -k "${ANSWERS_FILE}"
gzip "$ANSWERS_FILE"
du -k "${ANSWERS_FILE}.gz"
92 src/literate_wordle/assets/wordle_accepted_words_dict.txt 36 src/literate_wordle/assets/wordle_accepted_words_dict.txt.gz
This time is more like two thirds shaved off, sweet.
We reach to add a function for decompressing, but realize we wrote all this before, except for a different filename. So let’s edit the zip extraction code to be more generic.
One way this can be more generic is returning a set
of strings, instead of the
previous list
. This means we assume no ordering and use hash addressing,
rather than strict string ordering. After all, we will not iterate through the
list, as much as we want to randomly access entries, so the set
will provide
benefits down the line.
def get_asset_zip_as_set(asset_filename: str) -> set[str]:
"""Decompress a file in assets module into a set of words, separated by newline"""
compressed_bytes = pkg_resources.read_binary(assets, asset_filename)
string = gzip.decompress(compressed_bytes).decode("ascii")
string_list = [word.strip().lower().strip() for word in string.split("\n")]
return set(string_list)
In order to avoid hardcoded filenames, we yank out the file names and fetching of files:
ANSWERS_FILENAME = "wordle_answers_dict.txt.gz"
ACCEPTED_FILENAME = "wordle_accepted_words_dict.txt.gz"
def get_answers() -> set[str]:
"""Grab the Wordle answers as a set of string words"""
return get_asset_zip_as_set(ANSWERS_FILENAME)
def get_accepted_words() -> set[str]:
"""Grab the Wordle accepted words dictionary as a set of string words"""
return get_asset_zip_as_set(ACCEPTED_FILENAME)
And now we can use the dictionary as a set in our check_valid_word
function:
"""Check a wordle guess is valid: length and in dictionary"""
answer_length = 5
guess_length = len(guess)
if guess_length < answer_length:
return False, "Guess too short"
if guess_length > answer_length:
return False, "Guess too long"
valid_words_dict = get_accepted_words()
if guess in valid_words_dict:
return True, None
return False, "Not a word from the dictionary"
Small performance note: Having a set
of strings means guess in answers_set
comparison is
O(1)
(instead of O(n)
on dictionary size for list
), because the
hash-addressing of set
is a O(1)
operation. On very very long list of words,
iterating through it could be expensive, hence using set
for lookup if we
don’t need sequential access.
We changes the invocation of pick_answer_word
to use the new functions too
def pick_answer_word() -> str:
"""Pick a single word out of the dictionary of answers"""
return choice(list(get_answers()))
And we’re done! Let’s run our system through make
again, to spot test failures
but also to get linters:
make
poetry install Installing dependencies from lock file No dependencies to install or update Installing the current project: literate_wordle (0.1.0) pre-commit run --all --all-files Emacs export org-mode file to static HTML................................Passed Trim Trailing Whitespace.................................................Passed Fix End of Files.........................................................Passed Check for added large files..............................................Passed Check that executables have shebangs.................(no files to check)Skipped Check for case conflicts.................................................Passed Check vcs permalinks.....................................................Passed Forbid new submodules....................................................Passed Mixed line ending........................................................Passed Check for merge conflicts................................................Passed Detect Private Key.......................................................Passed Check Toml...............................................................Passed Check Yaml...............................................................Passed Check JSON...........................................(no files to check)Skipped black....................................................................Passed isort (python)...........................................................Passed mypy.....................................................................Passed flake8...................................................................Passed cd docs && make html make[1]: Entering directory '/home/jiby/dev/ws/short/literate_wordle/docs' Running Sphinx v4.4.0 Read in collections ... wordle_html_export_filecopy: Initialised gherkin_features_foldercopy: Initialised gherkin_features_jinja: Initialised Clean collections ... gherkin_features_foldercopy: (CopyFolderDriver) Folder deleted: /home/jiby/dev/ws/short/literate_wordle/docs/source/_collections/gherkin_features/ gherkin_features_jinja: (JinjaDriver) Cleaning 1 jinja Based file/s ... Executing collections ... wordle_html_export_filecopy: (CopyFileDriver) Copy file... gherkin_features_foldercopy: (CopyFolderDriver) Copy folder... gherkin_features_jinja: (JinjaDriver) Creating 1 file/s from Jinja template... loading pickled environment... done [autosummary] generating autosummary for: _collections/gherkin_feature.md, index.rst, readme.md, wordle.md, wordle_sources.md [AutoAPI] Reading files... [ 33%] /home/jiby/dev/ws/short/literate_wordle/src/literate_wordle/__init__.py [AutoAPI] Reading files... [ 66%] /home/jiby/dev/ws/short/literate_wordle/src/literate_wordle/words.py [AutoAPI] Reading files... [100%] /home/jiby/dev/ws/short/literate_wordle/src/literate_wordle/assets/__init__.py [AutoAPI] Mapping Data... [ 33%] /home/jiby/dev/ws/short/literate_wordle/src/literate_wordle/__init__.py [AutoAPI] Mapping Data... [ 66%] /home/jiby/dev/ws/short/literate_wordle/src/literate_wordle/words.py [AutoAPI] Mapping Data... [100%] /home/jiby/dev/ws/short/literate_wordle/src/literate_wordle/assets/__init__.py [AutoAPI] Rendering Data... [ 33%] literate_wordle [AutoAPI] Rendering Data... [ 66%] literate_wordle.words [AutoAPI] Rendering Data... [100%] literate_wordle.assets myst v0.15.2: MdParserConfig(renderer='sphinx', commonmark_only=False, enable_extensions=['dollarmath'], dmath_allow_labels=True, dmath_allow_space=True, dmath_allow_digits=True, dmath_double_inline=False, update_mathjax=True, mathjax_classes='tex2jax_process|mathjax_process|math|output_area', disable_syntax=[], url_schemes=['http', 'https', 'mailto', 'ftp'], heading_anchors=2, heading_slug_func=None, html_meta=[], footnote_transition=True, substitutions=[], sub_delimiters=['{', '}'], words_per_minute=200) building [mo]: targets for 0 po files that are out of date building [html]: targets for 5 source files that are out of date updating environment: 0 added, 7 changed, 0 removed reading sources... [ 14%] _collections/gherkin_feature reading sources... [ 28%] autoapi/index reading sources... [ 42%] autoapi/literate_wordle/assets/index reading sources... [ 57%] autoapi/literate_wordle/index reading sources... [ 71%] autoapi/literate_wordle/words/index reading sources... [ 85%] wordle reading sources... [100%] wordle_sources Copying static files for sphinx-needs datatables support.../home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/datatables_loader.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/datatables.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/sphinx_needs_collapse.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/datatables.min.css /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/JSZip-2.5.0/jszip.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/js/buttons.print.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/js/buttons.flash.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/js/buttons.html5.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/js/buttons.colVis.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/js/dataTables.buttons.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/js/buttons.html5.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/css/common.scss /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/css/mixins.scss /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/css/buttons.dataTables.min.css /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/swf/flashExport.swf /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/DataTables-1.10.16/js/jquery.dataTables.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/DataTables-1.10.16/css/jquery.dataTables.min.css /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/DataTables-1.10.16/images/sort_asc.png /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/DataTables-1.10.16/images/sort_desc_disabled.png /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/DataTables-1.10.16/images/sort_asc_disabled.png /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/DataTables-1.10.16/images/sort_both.png /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/DataTables-1.10.16/images/sort_desc.png /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/ColReorder-1.4.1/js/dataTables.colReorder.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/ColReorder-1.4.1/css/colReorder.dataTables.min.css /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/FixedColumns-3.2.4/js/dataTables.fixedColumns.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/FixedColumns-3.2.4/css/fixedColumns.dataTables.min.css /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Scroller-1.4.4/js/dataTables.scroller.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Scroller-1.4.4/css/scroller.dataTables.min.css /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/FixedHeader-3.1.3/js/dataTables.fixedHeader.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/FixedHeader-3.1.3/css/fixedHeader.dataTables.min.css /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Responsive-2.2.1/js/dataTables.responsive.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Responsive-2.2.1/css/responsive.dataTables.min.css /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/pdfmake-0.1.32/pdfmake.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/pdfmake-0.1.32/vfs_fonts.js Copying static files for sphinx-needs custom style support...[ 25%] common.css Copying static files for sphinx-needs custom style support...[ 50%] /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/css/modern/layouts.css Copying static files for sphinx-needs custom style support...[ 75%] /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/css/modern/styles.css Copying static files for sphinx-needs custom style support...[100%] /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/css/modern/modern.css looking for now-outdated files... none found pickling environment... done checking consistency... /home/jiby/dev/ws/short/literate_wordle/docs/source/autoapi/index.rst: WARNING: document isn't included in any toctree done preparing documents... done writing output... [ 12%] _collections/gherkin_feature writing output... [ 25%] autoapi/index writing output... [ 37%] autoapi/literate_wordle/assets/index writing output... [ 50%] autoapi/literate_wordle/index writing output... [ 62%] autoapi/literate_wordle/words/index writing output... [ 75%] index writing output... [ 87%] wordle writing output... [100%] wordle_sources /home/jiby/dev/ws/short/literate_wordle/docs/source/_collections/gherkin_feature.md:34: WARNING: Any IDs not assigned for table node generating indices... genindex py-modindex done highlighting module code... [ 50%] literate_wordle highlighting module code... [100%] literate_wordle.words writing additional pages... search done copying images... [ 50%] /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/images/feather_svg/arrow-down-circle.svg copying images... [100%] /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/images/feather_svg/arrow-right-circle.svg copying static files... done copying extra files... done dumping search index in English (code: en)... done dumping object inventory... done build succeeded, 2 warnings. The HTML pages are in build/html. Final clean of collections ... wordle_html_export_filecopy: (CopyFileDriver) File deleted: /home/jiby/dev/ws/short/literate_wordle/docs/source/_collections/_static/wordle.html gherkin_features_foldercopy: (CopyFolderDriver) Folder deleted: /home/jiby/dev/ws/short/literate_wordle/docs/source/_collections/gherkin_features/ gherkin_features_jinja: (JinjaDriver) Cleaning 1 jinja Based file/s ... gherkin_features_jinja: (JinjaDriver) File deleted: /home/jiby/dev/ws/short/literate_wordle/docs/source/_collections/gherkin_feature.md Checking sphinx-needs warnings make[1]: Leaving directory '/home/jiby/dev/ws/short/literate_wordle/docs' poetry run pytest ============================= test session starts ============================== platform linux -- Python 3.9.5, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/jiby/dev/ws/short/literate_wordle/.venv/bin/python cachedir: .pytest_cache rootdir: /home/jiby/dev/ws/short/literate_wordle, configfile: pyproject.toml plugins: cov-3.0.0, clarity-1.0.1 collecting ... collected 5 items tests/test_checking_guess_valid_word.py::test_reject_long_words PASSED [ 20%] tests/test_checking_guess_valid_word.py::test_reject_overly_short_words PASSED [ 40%] tests/test_checking_guess_valid_word.py::test_reject_nondict_words PASSED [ 60%] tests/test_checking_guess_valid_word.py::test_accept_dict_words PASSED [ 80%] tests/test_pick_word.py::test_pick_word_ok_length PASSED [100%] - generated xml file: /home/jiby/dev/ws/short/literate_wordle/test_results/results.xml - ----------- coverage: platform linux, python 3.9.5-final-0 ----------- Name Stmts Miss Cover ------------------------------------------------------------ src/literate_wordle/__init__.py 1 0 100% src/literate_wordle/assets/__init__.py 0 0 100% src/literate_wordle/words.py 23 0 100% ------------------------------------------------------------ TOTAL 24 0 100% Coverage HTML written to dir test_results/coverage.html Coverage XML written to file test_results/coverage.xml ============================== 5 passed in 0.09s =============================== poetry build Building literate_wordle (0.1.0) - Building sdist - Built literate_wordle-0.1.0.tar.gz - Building wheel - Built literate_wordle-0.1.0-py3-none-any.whl
Tests pass, coverage stays strong, and linters are quiet, this is great!
We mentioned before that the whole dictionary would get unzipped on every request for assets. Now we’re validating guessed words, we may want to be processing guesses quite often, certainly quicker than one would pick secret words!
What we want to make all this fast, is to cache the unzipped dictionary,
so that repeated calls to the function get_asset_zip_as_set
don’t bother with
file open and unzip, just serve the few hundred kilobytes content from memory again. There’s a handy python decorator that does the trick! Let’s
add functools.cache
on top of our slow function:
from functools import cache
@cache
After rerunning our tests, we now have a (theoretically) faster function, yey!
Remember that we committed a couple of performance/optimization sins just then, by both: optimizing prematurely (with no proof of slowness), and by doing optimization without using profiling information to optimize, we very likely just optimized something that isn’t our bottleneck. I’m fine with that, I just wanted to showcase this cool decorator, which functions like an unbounded memoizer. Let’s see quick performance numbers of before/after:
poetry run python3 -m timeit -v -n 1000 --setup "from literate_wordle.words import pick_answer_word, check_valid_word" "check_valid_word(pick_answer_word())"
raw times: 2.75 sec, 2.72 sec, 2.73 sec, 2.73 sec, 2.72 sec
:
1000 loops, best of 5: 2.72 msec per loop
And after cacheing:
raw times: 17.1 msec, 12.8 msec, 12.6 msec, 12.8 msec, 12.4 msec
:
1000 loops, best of 5: 12.4 usec per loop
That’s a two orders of magnitude gain for a single line of code changed. Sweet.
Doing some exploration of the accepted/answer word sets, I noticed an issue:
from literate_wordle.words import get_answers, get_accepted_words
answer_lengths = [len(word) for word in list(get_answers())]
accepted_lengths = [len(word) for word in list(get_accepted_words())]
print(set(answer_lengths))
print(set(accepted_lengths))
{0, 5} {0, 5}
Each have a 0-length words, in other word, the empty string.
This is likely a classic issue due to DOS line endings, the last line of the
file is only a carriage return, which is technically whitespace, and the call to
strip()
removes it, leaving an empty space item in the list.
If this was a proper production issue we just discovered, we would first turn the above snippet into a proper test case (asserting no 0 length word exist, seeing it be red), commit that, raise it as bug, and work on a fix. But this code hasn’t reached production yet, and the bug itself is minor enough to not warrant that during our exploration phase.
We can fix this multiple ways. We could make the get_accepted_words
and
get_answers
functions change their behaviours (either via set operations to
remove the empty item from the set, returning set(words) - set([""])
, or more
likely removing empty entries during iteration), but that wouldn’t prevent
future users of the buggy function get_asset_zip_as_set
to get the same issue.
So let’s fix it at the root, the get_asset_zip_as_set
function:
def get_asset_zip_as_set(asset_filename: str) -> set[str]:
"""Decompress a file in assets module into a set of words, separated by newline"""
compressed_bytes = pkg_resources.read_binary(assets, asset_filename)
string = gzip.decompress(compressed_bytes).decode("ascii")
string_list = [word.strip().lower().strip() for word in string.split("\n")]
# Protect against whitespace-only lines during file-read causing empty stripped word
non_empty_words = [word for word in string_list if len(word) != 0]
return set(non_empty_words)
This was a good opportunity to play with List Comprehensions with filters, yey.
The last section of each heading of this document is used for internal purposes. The Code snippets defined above are usually out of order, especially the imports, or functions defined once as stubs, then re-defined with proper implementation.
To avoid having nonsense python file ordering, with import-feature-import-feature
sequences, which formatters would go crazy over, we define below the reordered
code blocks as they should be output, using the noweb
feature of org-mode.
This lets us reference code blocks above by name, tangle out into the
proper files with proper ordering and spacing as one would expect a real
codebase to look like.
This means we need to manually weave the code blocks, instead of pointing them all to the same file and rely on code snippet’s top-to-bottom order, we now have an explicit code block where we template out “add this bit, now 2 lines below add that snippet, and then…”. This isn’t super pretty, but it gives complete control over layout like number of lines jumped between functions, which was blocking adoption of the formatter “black” in this repository.
First, fix words.py
imports being out of order in our narrative by tangling
them via noweb to weave the part 1 imports with the part 2. This means isort
(import sorter[fn::Sorts import-code first by category, then alphabetically.
Category of imports is in decreasing order: stdlib, then third party packages,
then local module imports.]) is now happy and won’t thrash these python files.
Also insert the cache decorator before the assets function, and substitute the
check_valid_word
function body with the real implementation instead of the
dummy function defined initially.
<<choice-module-docstring>>
<<choice-stdlib>>
<<valid-cache-import>>
<<choice-stdlib2>>
<<valid-stdlib>>
<<choice-locallib>>
<<choice-magicstrings>>
<<choice-func-getdicts>>
<<valid-cache-decorator>>
<<choice-func-unzipdict-generic2>>
<<choice-func-pickanswer-generic>>
<<valid-func-proto>>
<<valid-func-len-dict>>
Now the same thing with the tests file, which indeed is in proper order already, but would benefit from two-lines-between-tests to guarantee formatting:
<<test-valid-import>>
<<test-valid-1>>
<<test-valid-2>>
<<test-valid-3>>
<<test-valid-4>>
<<reject-reason-none>>
We can pick answer words, and we can check if a guess is a valid word, now we have everything we need to score the guess! Let’s first define the overall feature:
Feature: Scoring guesses
As a Wordle game
I need to tell the player how good their guess is
In order to help them find the proper answer
This sounds simple, but implementing this feature is tricky, because of edge cases like multiple identical character in the answer, which need colored appropriately (What’s the proper way to do that? No clue yet, but we need to pin it down in requirements!). So again we’ll define Gherkin Scenarios for that Feature, to give examples of how the feature works in practice. So we write out:
Scenario: Perfect guess gives perfect score
Given a wordle answer "crane"
When scoring the guess "crane"
Then score should be "🟩🟩🟩🟩🟩"
This seems easy enough, but we should notice that we’re assuming the
guess is a valid word! We may want to just add another Given
, like:
Given a guess that's a valid dictionary word
But this isn’t just a hypothesis from the current scenario, it’s valid for all scenarios of this feature: every scoring of a guess requires the guess to be a valid word. To avoid the tedious copying of that assumption in each Scenario, we can use a Gherkin Background for the feature:
Background:
Given a guess that's a valid dictionary word
Perfect, so now we’re assuming the guess is a valid word, which means a dependency on having implemented the previous feature, but we’re not specifying the guess word itself, which can still be scenario specific. This makes our initial “perfect guess” scenario valid again, so we can use it
If we’ve got the perfect answer, let’s have the opposite:
Scenario: No character in common
Given a wordle answer "brave"
When scoring the guess "skill"
Then score should be "⬜⬜⬜⬜⬜"
Note that these scenarios doesn’t make assumption of how many attempts at Wordle we’re at, or the fact of winning or losing. This is purely a hypothetical example, disjoint from the actual playing of a Wordle game. We can deal with the win/lose consequences later, once we have a proper scoring of guesses implemented.
At this point, we can conceivably start the implementation work: “let’s go, we have work to do!” And we can add the “🟨” scenario later once we have code that works.
The problem of “what to do now” is interesting, because we can continue thinking up scenarios in Gherkin for a while, or we could make a start writing test code to match these claims, fix the red tests, implement towards green tests, and add scenarios as we realize that our implementation is lacking compared to the original intent of the game. That can certainly be done!
But while it’s tempting to jump into code first, I strongly believe we as developers should instead fully scope out the problem-space first. Pin down the exact requirements (in that case via Gherkin features and scenarios), before starting to touch any code. My reasoning is that it’s very easy to get tunnel vision when writing code, getting excited about the programming problems, losing track of what the “user” wants. We should instead write down the exact user needs first, and have a proper “ritual” for switching our “User” hat to a “Developer” hat.
So, back to our gherkin scenarios, let’s add the yellow marker one:
Scenario: Character in wrong place
Given a wordle answer "rebus"
When scoring the guess "skull"
Then score should be "🟨⬜🟨⬜⬜"
And just for having a good sample of tests with which to test, let’s use a table of examples to confirm scoring works out in more cases:
Scenario Outline: Scoring guesses
Given a wordle <answer>
When scoring <guess>
Then score should be <score>
# Emoji (Unicode) character rendering is hard:
# Please forgive the table column alignment issues!
Examples: A few guesses and their score
| answer | guess | score |
| adage | adobe | 🟩🟩⬜⬜🟩 |
| serif | quiet | ⬜⬜🟨🟨⬜ |
| raise | radix | 🟩🟩⬜🟨⬜ |
Note how the “outline” system maps really well to the idea of “parametrized tests”. We can write the test case once, and have a decorator deal with the multiple instantiations with different data.
All right, that’s a few, moving on. But here is the most difficult to implement corner case, written out as examples of the previous scenario:
Examples: Multiple occurences of same character
| answer | guess | score |
| abbey | kebab | ⬜🟨🟩🟨🟨 |
| abbey | babes | 🟨🟨🟩🟩⬜ |
| abbey | abyss | 🟩🟩🟨⬜⬜ |
| abbey | algae | 🟩⬜⬜⬜🟨 |
| abbey | keeps | ⬜🟨⬜⬜⬜ |
| abbey | abate | 🟩🟩⬜⬜🟨 |
Because this edge case was worrisome for accuracy, these sample answers and scores were taken from online example screenshots of the original Wordle website, thus considered accurate references.
Thinking about it, with “abbey” as reference, the “kebab” answer seems logical, with first “b” occurence matching as green, and the second being in the wrong place. The surprise comes from “keeps” where the first “e” counts, but the second doesn’t have an equivalent in the answer, hence flagged as “no such character”. That makes sense, but that’s not how a naive implementation of the game would do it! Hence why it’s worth thinking about the full problem before rushing the implementation.
Out of curiosity, I wonder if there’s any wordle answers that contain three identical characters? Let’s see!
zgrep -i -E "([a-z]).*\1.*\1" \
src/literate_wordle/assets/wordle_answers_dict.txt.gz \
| wc -l
20
Really? 20? That’s harsh … show me one?
zgrep -i -E "([a-z]).*\1.*\1" \
src/literate_wordle/assets/wordle_answers_dict.txt.gz \
| head -n 1 \
| sed 's/\r//' # gets rid of CR characters in CRLF (DOS line endings)
bobby
Interesting. That must be hard to solve I imagine.
With no more obvious pathological cases to cover in requirements, it’s time to switch to our developer hat, and write some (acceptance) tests!
def test_perfect_guess():
"""Scenario: Perfect guess gives perfect score"""
# Given a wordle answer "crane"
answer = "crane"
# When scoring the guess "crane"
our_guess = "crane"
score = score_guess(our_guess, answer)
# Then score should be "🟩🟩🟩🟩🟩"
assert score == "🟩🟩🟩🟩🟩", "Perfect answer should give Perfect Score"
A score_guess
function? sounds reasonable. We’ll need to import it from a module…
from literate_wordle.guess import score_guess
This means we now need to create such a module.
"""Score guesses of Wordle game"""
We already defined most of the function (name, module, output), so let’s just write a stub that will make tests go red.
def score_guess(guess: str, answer: str) -> str:
"""Score an individual guess"""
return "⬜"
Now the test should fail appropriately, let’s add a twist: we’ll mark the test function as expected to fail, because for now it’s not been implemented. This allows the test runner to mark all tests as OK despite known failures, and is perfect for known bugs being worked on, or new features being built. Imagine if every time we built new features via TDD, the commit that adds the test first makes CI go red! No, we would rather have a nice “excuse” for this new test to fail, and have the build stay green, “with an expected failure”.
@pytest.mark.xfail(reason="Not implemented yet")
In the case of a known bug, the reason
field would very likely be a bug
identifier in the organisation’s bug tracker.
import pytest
Confirm these tests work, marked as xfail (“eXpected FAILure”):
make test
poetry run pytest ============================= test session starts ============================== platform linux -- Python 3.9.5, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/jiby/dev/ws/short/literate_wordle/.venv/bin/python cachedir: .pytest_cache rootdir: /home/jiby/dev/ws/short/literate_wordle, configfile: pyproject.toml plugins: cov-3.0.0, clarity-1.0.1 collecting ... collected 6 items tests/test_checking_guess_valid_word.py::test_reject_long_words PASSED [ 16%] tests/test_checking_guess_valid_word.py::test_reject_overly_short_words PASSED [ 33%] tests/test_checking_guess_valid_word.py::test_reject_nondict_words PASSED [ 50%] tests/test_checking_guess_valid_word.py::test_accept_dict_words PASSED [ 66%] tests/test_pick_word.py::test_pick_word_ok_length PASSED [ 83%] tests/test_scoring_guess.py::test_perfect_guess XFAIL (Not implement...) [100%] - generated xml file: /home/jiby/dev/ws/short/literate_wordle/test_results/results.xml - ----------- coverage: platform linux, python 3.9.5-final-0 ----------- Name Stmts Miss Cover ------------------------------------------------------------ src/literate_wordle/__init__.py 1 0 100% src/literate_wordle/assets/__init__.py 0 0 100% src/literate_wordle/guess.py 2 0 100% src/literate_wordle/words.py 25 0 100% ------------------------------------------------------------ TOTAL 28 0 100% Coverage HTML written to dir test_results/coverage.html Coverage XML written to file test_results/coverage.xml ========================= 5 passed, 1 xfailed in 0.10s =========================
Note that we now have regular tests that pass, and this one test that fails as expected, and
pytest
, expecting it, doesn’t shout about the failure. Really handy.
Remember that “disabling” (marking as pytest.mark.skip
) is different from
marking as xfail
, because skipping a test avoids running it, while xfail
tests do run, the assertion failure is just not marked as critical. There’s even
a flag to make xpass
(expected test failures that ended up being green) become
an actual fatal testing error, for the cases where it’s important to track the
failure itself.
Let’s implement the rest of the failing tests, so we can make it all red, then fix the implementation:
def test_no_common_character():
"""Scenario: No character in common"""
# Given a wordle answer "brave"
answer = "brave"
# When scoring the guess "skill"
our_guess = "skill"
score = score_guess(our_guess, answer)
# Then score should be "⬜⬜⬜⬜⬜"
assert score == "⬜⬜⬜⬜⬜", "No character in common with answer should give 0 score"
def test_wrong_place():
"""Scenario: Character in wrong place"""
# Given a wordle answer "rebus"
answer = "rebus"
# When scoring the guess "skull"
our_guess = "skull"
score = score_guess(our_guess, answer)
# Then score should be "🟨⬜🟨⬜⬜"
assert score == "🟨⬜🟨⬜⬜", "Characters are in the wrong place"
That covers the first three scenarios.
For the Scenario Outline, it’s interesting to notice that a pattern emerged, which allows the same test skeleton to be reused with different data. In Pytest, this can be done by “parametrizing” the test with multiple data entries.
This is a decorator to flag data, but since
we’re trying to group some of those tests as part of different groups, we will
use the pytest.param.id
flag.
def test_generic_score(answer, our_guess, expected_score):
"""Scenario Outline: Scoring guesses"""
# Given a wordle <answer>
# When scoring <guess>
score = score_guess(our_guess, answer)
# Then score should be <score>
assert score == expected_score
Just need to fill in the parameters:
@pytest.mark.parametrize(
"answer,our_guess,expected_score",
[
pytest.param("adage", "adobe", "🟩🟩⬜⬜🟩", id="normal_guess1"),
pytest.param("serif", "quiet", "⬜⬜🟨🟨⬜", id="normal_guess2"),
pytest.param("raise", "radix", "🟩🟩⬜🟨⬜", id="normal_guess3"),
pytest.param("abbey", "kebab", "⬜🟨🟩🟨🟨", id="multi_occur1"),
pytest.param("abbey", "babes", "🟨🟨🟩🟩⬜", id="multi_occur2"),
pytest.param("abbey", "abyss", "🟩🟩🟨⬜⬜", id="multi_occur3"),
pytest.param("abbey", "algae", "🟩⬜⬜⬜🟨", id="multi_occur4"),
pytest.param("abbey", "keeps", "⬜🟨⬜⬜⬜", id="multi_occur5"),
pytest.param("abbey", "abate", "🟩🟩⬜⬜🟨", id="multi_occur6"),
],
)
With the strong test harness we have, this scoring function can be done conveniently.
Let’s experiment with the solution, iterating over naive solution and seeing how close they get to implementing the feature, by number of tests failed. This isn’t required, we have already identified edge cases that make naive solutions break, but this is the fun experimenting part.
Before any actual code change, first we remove the “xfail” marker, so that test failures actually notify us as failures, as we’re actually implementing things now.
def score_guess(guess: str, answer: str) -> str:
"""Score an individual guess naively"""
NO = "⬜"
OK = "🟩"
response = ""
for answer_char, guess_char in zip(answer, guess):
if answer_char == guess_char:
response += OK
else:
response += NO
return response
That only passes 3 tests of the 12 we just defined, obviously because we don’t deal with incorrect characters at all. So let’s add keeping track of characters in the wrong places:
def score_guess(guess: str, answer: str) -> str:
"""Score an individual guess a little less naively"""
NO = "⬜"
OK = "🟩"
WRONG_PLACE = "🟨"
answer_chars_set = set(list(answer))
response = ""
for answer_char, guess_char in zip(answer, guess):
if answer_char == guess_char:
response += OK
elif guess_char in answer_chars_set:
response += WRONG_PLACE
else:
response += NO
return response
That version now passes 8 of 12 tests, with the issue being the multiple occurence of the same character in the answer being treated wrong, clearly an edge case we were fortunate to identify early.
Looking at the examples, it seems that our scoring function needs to keep track of how many occurences of each characters of the answer exists overall, and grade only the first occurence of such characters as “wrong place”, reducing the counter.
Fortunately, Python implements a good Counter function which we can import:
from collections import Counter
We want something like this:
if guess_char in answer_chars and answer_chars[guess_char] > 0:
response += WRONG_PLACE
# Reduce occurence since we "used" this one
answer_chars[guess_char] -= 1
# No more hits = pretend character isn't even seen (remove from dict)
if answer_chars[guess_char] == 0:
del answer_chars[guess_char]
So we try the Counter way
def score_guess(guess: str, answer: str) -> str:
"""Score an individual guess with Counter"""
NO = "⬜"
OK = "🟩"
WRONG_PLACE = "🟨"
# Counter("abbey") = Counter({'b': 2, 'a': 1, 'e': 1, 'y': 1})
answer_chars = Counter(answer)
response = ""
for answer_char, guess_char in zip(answer, guess):
if answer_char == guess_char:
response += OK
elif guess_char in answer_chars and answer_chars[guess_char] > 0:
response += WRONG_PLACE
# Reduce occurence since we "used" this one
answer_chars[guess_char] -= 1
# No more hits = pretend character isn't even seen (remove from dict)
if answer_chars[guess_char] == 0:
del answer_chars[guess_char]
else:
response += NO
return response
But while this improves the score, we are still 3 tests from success! Turns out we only did the reduction of counter for yellow, not also greens. This needs a bit of reshuffling:
def score_guess(guess: str, answer: str) -> str:
"""Score an individual guess with Counter"""
NO = "⬜"
OK = "🟩"
WRONG_PLACE = "🟨"
# Counter("abbey") = Counter({'b': 2, 'a': 1, 'e': 1, 'y': 1})
answer_chars = Counter(answer)
response = ""
for guess_char, answer_char in zip(guess, answer):
if guess_char not in answer_chars:
response += NO
continue # Early exit for this character, skip to next
# From here on, we MUST have a char in common, regardless of place
if answer_char == guess_char:
response += OK
elif answer_chars[guess_char] > 0:
response += WRONG_PLACE
# Either way, reduce occurence counter since we "used" this occurence
answer_chars[guess_char] -= 1
# No more hits = pretend character isn't even seen (remove from dict)
if answer_chars[guess_char] == 0:
del answer_chars[guess_char]
return response
Now that we’re happy with this, we can refactor out the ugly hardcoded glyphs:
class CharacterScore(str, Enum):
"""A single character's score"""
OK = "🟩"
NO = "⬜"
WRONG_PLACE = "🟨"
from enum import Enum
And to use it as part of our scoring function:
def score_guess(guess: str, answer: str) -> str:
"""Score an individual guess with Counter"""
# Counter("abbey") = Counter({'b': 2, 'a': 1, 'e': 1, 'y': 1})
answer_chars = Counter(answer)
response = ""
for guess_char, answer_char in zip(guess, answer):
if guess_char not in answer_chars:
response += CharacterScore.NO
continue # Early exit for this character, skip to next
# From here on, we MUST have a char in common, regardless of place
if answer_char == guess_char:
response += CharacterScore.OK
elif answer_chars[guess_char] > 0:
response += CharacterScore.WRONG_PLACE
# Either way, reduce occurence counter since we "used" this occurence
answer_chars[guess_char] -= 1
# No more hits = pretend character isn't even seen (remove from dict)
if answer_chars[guess_char] == 0:
del answer_chars[guess_char]
return response
As before, we reorder the blocks from snippets above to export code in a way that keeps proper formatting.
<<scoring-guessmod-header>>
<<scoring-guessfunc-import>>
<<scoring-guess-enum-import>>
<<scoring-guess-enum>>
<<scoring-guessfunc-impl2>>
"""Validates the Gherkin file features/scoring_guess.feature:
<<scoring-feature>>
"""
<<scoring-test-import-pytest>>
<<scoring-test-import>>
<<scoring-test1>>
<<scoring-test2>>
<<scoring-test3>>
<<scoring-multi-parameters>>
<<scoring-multi-skeleton>>
With all the subfeatures we have, we can now play a round of wordle, we’re missing only the “state” of the game board, with the interactivity of the game.
Feature: Track number of guesses
As a Wordle game
I need to track how many guesses were already given
In order to announce win or game over
There are a few obvious cases we want to see:
Scenario: First guess is allowed
Given a wordle answer
And I didn't guess before
When I guess the word
Then my guess is scored
Scenario: Sixth guess still allowed
Given a wordle answer
And I guessed 5 times
When I guess the word
Then my guess is scored
Scenario: Six failed guess is game over
Given a wordle answer
And I guessed 6 times already
When I guess the word
And my guess isn't the answer
Then my guess is scored
But game shows "Game Over"
And game shows the real answer
This feature shows us all the state we need to manage to track a Wordle game:
- an answer
- the number of previous guesses
- the previous guesses themselves? not needed after we print them
- the previous guesses’ scores? not needed after we print it either
So a Wordle Game is the aggregate of “answer” + “number of guesses”, nothing else.
Let’s write the test:
"""Validates the Gherkin file features/track_guesses.feature
<<track-guess-feat2>>
"""
def test_first_guess_allowed():
"""Scenario: First guess is allowed"""
# Given a wordle answer
answer = "orbit"
# And I didn't guess before
guess_number = 0
game = WordleGame(answer=answer, guess_number=guess_number)
# When I guess the word
guess = "kebab"
result = play_round(guess, game)
# Then my guess is scored
OUTCOME_CONTINUE = WordleMoveOutcome.GUESS_SCORED_CONTINUE
assert result.outcome == OUTCOME_CONTINUE, "Game shouldn't be over yet"
assert result.score is not None, "No score given as result"
assert len(result.score) == 5, "Score of incorrect length"
ALLOWED_CHARS = [score.value for score in Score]
assert all(
char in ALLOWED_CHARS for char in list(result.score)
), "Score doesn't match score's characters"
In the test above, I’ve done quite a bit of world-building:
- Used a new
WordleGame
structure keeping game state - Used a new
WordleMoveOutcome
enumeration to describe outcomes - Used a new
play_round
function that takes a game + guess - Implied in
result
variable at a structure for new Game state after a move
from literate_wordle.game import WordleGame, WordleMoveOutcome, play_round
from literate_wordle.guess import CharacterScore as Score
This practice of calling an API that doesn’t exist yet is the coolest part of TDD, because the tests lend their power to help design what the software should feel like, even if we have no idea how to create the backend to that API yet. The focus on how the feature is used changes from the usual engineering mindset of how we envision the backend, very valuable.
All right, so with that in mind, let’s start actually building these data structures.
class WordleMoveOutcome(Enum):
"""Outcome of a single move"""
GAME_OVER_LOST = 1
GAME_WON = 2
GUESS_SCORED_CONTINUE = 3
@dataclass
class WordleGame:
"""A Wordle game's internal state, before a move is played"""
answer: str
guess_number: int
@dataclass
class WordleMove:
"""A Wordle game state once a move is played"""
game: WordleGame
outcome: WordleMoveOutcome
message: str
score: Optional[str]
from dataclasses import dataclass
from enum import Enum
from typing import Optional
With the datastructures ready, we can define our function’s signature:
def play_round(guess: str, game: WordleGame) -> WordleMove:
"""Use guess on the given game, resulting in WordleMove"""
Before we finish implementing this function, let’s define the rest of the acceptance tests we settled on in Gherkin:
def test_sixth_guess_allowed():
"""Scenario: Fifth guess still allowed"""
# Given a wordle answer
answer = "orbit"
# And I guessed 5 times
guess_number = 6
game = WordleGame(answer=answer, guess_number=guess_number)
# When I guess the word
guess = "kebab"
result = play_round(guess, game)
# Then my guess is scored
OUTCOME_CONTINUE = WordleMoveOutcome.GUESS_SCORED_CONTINUE
assert result.outcome == OUTCOME_CONTINUE, "Game shouldn't be over yet"
assert result.score is not None, "No score given as result"
assert len(result.score) == 5, "Score of incorrect length"
OK_CHARS = ["🟩", "🟨", "⬜"]
assert all(
char in OK_CHARS for char in list(result.score)
), "Score doesn't match score's characters"
def test_seventh_guess_fails_game():
"""Scenario: Sixth failed guess is game over"""
# Given a wordle answer
answer = "orbit"
# And I guessed 6 times already
# Guessing 6 times BEFORE, using seventh now:
guess_number = 7
game = WordleGame(answer, guess_number)
# When I guess the word
# And my guess isn't the answer
guess = "kebab"
result = play_round(guess, game)
# Then my guess isn't scored
assert result.outcome == WordleMoveOutcome.GAME_OVER_LOST, "Should have lost game"
# But game shows "Game Over"
assert "game over" in result.message.lower(), "Should show game over message"
# And game shows the real answer
assert answer in result.message
As I write the test in Listing track-guess-test3, I notice there’s one case of
the enum
we haven’t covered(WordleMoveOutcome.GAME_WON
), which means the
play_round
scenarios aren’t correct yet. Let’s add the scenario for winning
the game!
Scenario: Winning guess
Given a wordle answer
And I guessed 3 times
When I guess the word
And my guess is the answer
Then my guess is scored
And score is perfect
And game shows "Game Won"
A little thought later, it seems we mixed up the requirements a little here (it
happens!). When designing the Gherkin Feature, we wrote about exhausting the
amounts of guesses, we weren’t thinking of win/lose conditions. But when writing
a play_round
function, it’s indeed very relevant, especially since the
existing scenarios covered most of the cases already. Ideally, we could have
added a separate Feature describing winning and losing, and dealt with it
separately. In practice, here, it’s simpler to just expand the feature’s scope,
even if it means the scope has creeped out a little. This is what real
engineering is about, aiming for perfection, but making compromises to match our
imperfect world where deadlines and tired developers exist.
Let’s fill in our winning case test:
def test_winning_guess_wins():
"""Scenario: Winning guess"""
# Given a wordle answer
answer = "orbit"
# And I guessed 3 times
guess_number = 3
game = WordleGame(answer, guess_number)
# When I guess the word
# And my guess is the answer
guess = answer
result = play_round(guess, game)
# Then my guess is scored
assert result.score is not None, "Guess should be scored"
# And the score is perfect
assert result.score == "🟩🟩🟩🟩🟩"
# And game shows "Game Won
assert result.outcome == WordleMoveOutcome.GAME_WON, "Should have won game"
assert "game won" in result.message.lower()
With all the tests ready, we cobble together a stub for play_round
to execute
the tests and see them go red.
result = WordleMoveOutcome.GAME_OVER_LOST
return WordleMove(game=game, outcome=result, message="You suck!", score=None)
All right, the tests do fail, right?
poetry run pytest 2>&1 || true
============================= test session starts ============================== platform linux -- Python 3.9.5, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/jiby/dev/ws/short/literate_wordle/.venv/bin/python cachedir: .pytest_cache rootdir: /home/jiby/dev/ws/short/literate_wordle, configfile: pyproject.toml plugins: cov-3.0.0, clarity-1.0.1 collecting ... collected 21 items tests/test_checking_guess_valid_word.py::test_reject_long_words PASSED [ 4%] tests/test_checking_guess_valid_word.py::test_reject_overly_short_words PASSED [ 9%] tests/test_checking_guess_valid_word.py::test_reject_nondict_words PASSED [ 14%] tests/test_checking_guess_valid_word.py::test_accept_dict_words PASSED [ 19%] tests/test_pick_word.py::test_pick_word_ok_length PASSED [ 23%] tests/test_scoring_guess.py::test_perfect_guess PASSED [ 28%] tests/test_scoring_guess.py::test_no_common_character PASSED [ 33%] tests/test_scoring_guess.py::test_wrong_place PASSED [ 38%] tests/test_scoring_guess.py::test_generic_score[normal_guess1] PASSED [ 42%] tests/test_scoring_guess.py::test_generic_score[normal_guess2] PASSED [ 47%] tests/test_scoring_guess.py::test_generic_score[normal_guess3] PASSED [ 52%] tests/test_scoring_guess.py::test_generic_score[multi_occur1] PASSED [ 57%] tests/test_scoring_guess.py::test_generic_score[multi_occur2] PASSED [ 61%] tests/test_scoring_guess.py::test_generic_score[multi_occur3] PASSED [ 66%] tests/test_scoring_guess.py::test_generic_score[multi_occur4] PASSED [ 71%] tests/test_scoring_guess.py::test_generic_score[multi_occur5] PASSED [ 76%] tests/test_scoring_guess.py::test_generic_score[multi_occur6] PASSED [ 80%] tests/test_track_guess_number.py::test_first_guess_allowed FAILED [ 85%] tests/test_track_guess_number.py::test_sixth_guess_allowed FAILED [ 90%] tests/test_track_guess_number.py::test_seventh_guess_fails_game FAILED [ 95%] tests/test_track_guess_number.py::test_winning_guess_wins FAILED [100%] =================================== FAILURES =================================== ___________________________ test_first_guess_allowed ___________________________ def test_first_guess_allowed(): """Scenario: First guess is allowed""" # Given a wordle answer answer = "orbit" # And I didn't guess before guess_number = 0 game = WordleGame(answer=answer, guess_number=guess_number) # When I guess the word guess = "kebab" result = play_round(guess, game) # Then my guess is scored OUTCOME_CONTINUE = WordleMoveOutcome.GUESS_SCORED_CONTINUE > assert result.outcome == OUTCOME_CONTINUE, "Game shouldn't be over yet" E AssertionError: Game shouldn't be over yet E assert == failed. [pytest-clarity diff shown] E E LHS vs RHS shown below E E <WordleMoveOutcome.GAME_OVER_LOST: 1> E <WordleMoveOutcome.GUESS_SCORED_CONTINUE: 3> E tests/test_track_guess_number.py:25: AssertionError ___________________________ test_sixth_guess_allowed ___________________________ def test_sixth_guess_allowed(): """Scenario: Sixth guess still allowed""" # Given a wordle answer answer = "orbit" # And I guessed 5 times guess_number = 6 game = WordleGame(answer=answer, guess_number=guess_number) # When I guess the word guess = "kebab" result = play_round(guess, game) # Then my guess is scored OUTCOME_CONTINUE = WordleMoveOutcome.GUESS_SCORED_CONTINUE > assert result.outcome == OUTCOME_CONTINUE, "Game shouldn't be over yet" E AssertionError: Game shouldn't be over yet E assert == failed. [pytest-clarity diff shown] E E LHS vs RHS shown below E E <WordleMoveOutcome.GAME_OVER_LOST: 1> E <WordleMoveOutcome.GUESS_SCORED_CONTINUE: 3> E tests/test_track_guess_number.py:46: AssertionError _________________________ test_seventh_guess_fails_game _________________________ def test_seventh_guess_fails_game(): """Scenario: Sixth failed guess is game over""" # Given a wordle answer answer = "orbit" # And I guessed 6 times already # Guessing 6 times BEFORE, using seventh now: guess_number = 7 game = WordleGame(answer, guess_number) # When I guess the word # And my guess isn't the answer guess = "kebab" result = play_round(guess, game) # Then my guess isn't scored assert result.outcome == WordleMoveOutcome.GAME_OVER_LOST, "Should have lost game" # But game shows "Game Over" > assert "game over" in result.message.lower(), "Should show game over message" E AssertionError: Should show game over message E assert in failed. [pytest-clarity diff shown] E E LHS vs RHS shown below E E game over E you suck! E tests/test_track_guess_number.py:69: AssertionError ___________________________ test_winning_guess_wins ____________________________ def test_winning_guess_wins(): """Scenario: Winning guess""" # Given a wordle answer answer = "orbit" # And I guessed 3 times guess_number = 3 game = WordleGame(answer, guess_number) # When I guess the word # And my guess is the answer guess = answer result = play_round(guess, game) # Then my guess is scored > assert result.score is not None, "Guess should be scored" E AssertionError: Guess should be scored E assert is not failed. [pytest-clarity diff shown] E E LHS vs RHS shown below E E None E tests/test_track_guess_number.py:86: AssertionError - generated xml file: /home/jiby/dev/ws/short/literate_wordle/test_results/results.xml - ----------- coverage: platform linux, python 3.9.5-final-0 ----------- Name Stmts Miss Cover ------------------------------------------------------------ src/literate_wordle/__init__.py 1 0 100% src/literate_wordle/assets/__init__.py 0 0 100% src/literate_wordle/game.py 20 0 100% src/literate_wordle/guess.py 19 0 100% src/literate_wordle/words.py 25 0 100% ------------------------------------------------------------ TOTAL 65 0 100% Coverage HTML written to dir test_results/coverage.html Coverage XML written to file test_results/coverage.xml =========================== short test summary info ============================ FAILED tests/test_track_guess_number.py::test_first_guess_allowed - Assertion... FAILED tests/test_track_guess_number.py::test_sixth_guess_allowed - Assertion... FAILED tests/test_track_guess_number.py::test_sixth_guess_fails_game - Assert... FAILED tests/test_track_guess_number.py::test_winning_guess_wins - AssertionE... ========================= 4 failed, 17 passed in 0.18s =========================
All right, let’s implement this.
First, if we have too many guesses already (before this one), we return game lost. This means we decide to fail not at the end of the failed sixth guess, but beginning of the seventh.
if game.guess_number >= 7:
message = f"Too many guesses: Game Over. Answer was: {game.answer}"
outcome = WordleMoveOutcome.GAME_OVER_LOST
return WordleMove(game=game, outcome=outcome, message=message, score=None)
In order to count a guess, it needs to be a valid word. This means importing some of our package’s functions.
from literate_wordle.guess import score_guess
from literate_wordle.words import check_valid_word
As we write the code to check if guess is valid word, we notice that if the word
isn’t valid, we can’t return GUESS_SCORED_CONTINUE
, because an invalid-word
guess shouldn’t be counted against the player! So we again revise the
WordleMoveOutcome
enum and because it’s a new enum case, we will need to add a
test for it to cover all grounds! Let’s put a pin in that, finish implementing
this first.
GUESS_NOTVALID_CONTINUE = 4
To compensate for having this enum defined all out of order, we’ll use again the
noweb
feature to weave code back in the enum, in the subsection below,
inserting this fourth possibility in the correct place, so the code looks like
it should.
valid, validity_msg = check_valid_word(guess)
if not valid and validity_msg is not None:
outcome = WordleMoveOutcome.GUESS_NOTVALID_CONTINUE
return WordleMove(game=game, outcome=outcome, message=validity_msg, score=None)
Now we’ve gotten rid of the cases where the guess was invalid.
# Guess now guaranteed to be valid: count it
game.guess_number += 1
score = score_guess(guess, game.answer)
if score == "🟩🟩🟩🟩🟩":
outcome = WordleMoveOutcome.GAME_WON
message = f"Correct! Game won in {game.guess_number - 1} guesses"
return WordleMove(game=game, outcome=outcome, message=message, score=score)
Hmm, but wouldn’t it be nice to avoid this hardcoded blob?
Let’s extend the CharacterScore
to give this.
@classmethod
@property
def perfect_score(cls) -> str:
"""All-good Wordle score for perfect guess"""
return "".join([cls.OK] * 5)
if score == CharacterScore.perfect_score:
outcome = WordleMoveOutcome.GAME_WON
message = f"Correct! Game won in {game.guess_number - 1} guesses"
return WordleMove(game=game, outcome=outcome, message=message, score=score)
from literate_wordle.guess import CharacterScore, score_guess
from literate_wordle.words import check_valid_word
# Only case left is "try another guess"
outcome = WordleMoveOutcome.GUESS_SCORED_CONTINUE
message = f"Try again! Guess number {game.guess_number - 1}. Score is: {score}"
return WordleMove(game=game, outcome=outcome, message=message, score=score)
Note that throughout this codebase, we made a lot of assumptions and repetitions around the length of a Wordle answer/guess, and this translate to repeated hardcoded-ness like above regarding emojis. These could have been addressed right away during implementation, and indeed we did, but it’s important to consider if the scope increase is worth it: generalized Wordle to N characters isn’t super interesting to me, as it would require cutting new dictionaries, etc, and I’m just not that into Wordle. This is the kind of technical design decision we can do by having a firm grasp on project scope, another advantage of deep understanding of project requirements.
Back to the implementation: tests should all pass now, make
is happy, but
there’s an interesting issue:
----------- coverage: platform linux, python 3.9.5-final-0 ----------- Name Stmts Miss Cover ------------------------------------------------------------ src/literate_wordle/__init__.py 1 0 100% src/literate_wordle/assets/__init__.py 0 0 100% src/literate_wordle/game.py 38 2 95% src/literate_wordle/guess.py 19 0 100% src/literate_wordle/words.py 25 0 100% ------------------------------------------------------------ TOTAL 83 2 98% Coverage HTML written to dir test_results/coverage.html Coverage XML written to file test_results/coverage.xml
We lowered coverage, nooo! Exploring the coverage HTML file in a browser, we see that the lines in question that aren’t covered are:
if not valid and validity_msg is not None:
outcome = WordleMoveOutcome.GUESS_NOTVALID_CONTINUE
return WordleMove(game=game, outcome=outcome, message=validity_msg, score=None)
Oh! That’s the test case we put a pin in! Right, so we’re back to writing that test. I wonder if we should write a whole scenario to back it up? It’s not really obvious!
If this test case spins out of an edge case of our implementation, it’s not really coming from a business requirement, so it’s probably not worth writing a Gherkin Scenario alongside the other ones. If it is indeed an overlooked requirement, then yes, add it to the requirements pile and write a feature.
Hmm, let’s write the test first, and see if the scenario that emerges is a requirement.
def test_invalid_guess_not_counted():
"""Scenario: Invalid guess isn't counted"""
# Given a wordle answer
answer = "orbit"
# And I guessed 3 times
guess_number = 3
game = WordleGame(answer=answer, guess_number=guess_number)
# When I guess the word
# But my guess isn't a dictionary word
guess = "xolfy"
result = play_round(guess, game)
# Then my guess is rejected as invalid word
OUTCOME_BADWORD = WordleMoveOutcome.GUESS_NOTVALID_CONTINUE
assert result.outcome == OUTCOME_BADWORD, "Guess should have been rejected"
# And my guess is not scored
assert result.score is None, "No score should be given on bad word"
Hmm, after some thought, it seems that the function we implemented, compared to the feature being described in Gherkin, is indeed different!
As mentioned before, the Gherkin feature was about tracking specific number of guesses, but we increased scope to consider the wider win scenario, using the “play round” feature. Expanding the feature again to cover more cases than just how many guesses, it needs to understand if the guess is correct word or not.
So for the specific purpose of tracking guesses as a feature, we’re already covered by existing scenarios. But not only are we missing edge cases of implementation, as we saw in coverage metrics, but this is the wider feature that a play a round Feature would cover.
This game’s implementation being so very near completion, I am not interested in creating another feature file, I’ll just expand a bit the original feature to be about being able to play a whole round, wins and losses included, just to keep this narrative barely on track.
Feature: Playing a round
As a Wordle game
I need to track how many guesses were already given, stating wins/losses
In order to play the game
Scenario: Invalid guess isn't counted
Given a wordle answer
And I guessed 3 times
When I guess the word
But my guess isn't a dictionary word
Then my guess is rejected as invalid word
And my guess is not scored
And with this new test, we’re back to passing tests and 100% coverage!
The feature first:
<<track-guess-feat2>>
<<track-guess-scenario1>>
<<track-guess-scenario2>>
<<track-guess-scenario3>>
<<track-guess-scenario4>>
<<track-guess-scenario5>>
The tests:
<<track-guess-test-docs>>
<<track-guess-test-import>>
<<track-guess-test1>>
<<track-guess-test2>>
<<track-guess-test3>>
<<track-guess-test4>>
# Case covered by existing gherkin feature:
# Intentional, see wordle.org for reasoning
<<track-guess-test5>>
"""Wordle game's state and playing rounds"""
<<track-guess-import-dataclass>>
<<track-guess-import-module>>
<<track-guess-gamestate1>>
<<track-guess-enum4>>
<<track-guess-gamestate2>>
<<track-guess-proto>>
<<track-guess-impl1>>
<<track-guess-impl2>>
<<track-guess-impl3>>
<<track-guess-impl4>>
<<track-guess-impl5>>
<<track-guess-impl6>>
And remember that we had to expand the CharacterScore
, so we need to re-tangle
it here:
<<scoring-guessmod-header>>
<<scoring-guessfunc-import>>
<<scoring-guess-enum-import>>
<<scoring-guess-enum>>
<<track-guess-perfectscore>>
<<scoring-guessfunc-impl2>>
We have assembled lego bricks into an almost finished product, as we have enough to play a single round. Let’s give this project a shell command to invoke, tying together all the other disjointed features.
Feature: Pywordle shell command
As a Wordle game
I need a shell command to launch the game
In order to give convenient entrypoint for players
I don’t think it’s necessary to give specific scenarios, because we’ve thoroughly tested the underlying implementation of the game, we just need to assemble it into a shell command.
So let’s define an entrypoint for the game, generating a new one:
def new_game() -> WordleGame:
"""Generate a new WordleGame"""
return WordleGame(answer=pick_answer_word(), guess_number=1)
And how to play until we lose, printing to stdout as we go:
def play_game(game: WordleGame, guess_fetcher: Callable, response_logger: Callable):
"""Plays the given WordleGame until completion.
Asks guess_fetcher for guess, and sends response to response_logger
"""
outcome = WordleMoveOutcome.GUESS_SCORED_CONTINUE # Gotta start somehow
while outcome not in {WordleMoveOutcome.GAME_WON, WordleMoveOutcome.GAME_OVER_LOST}:
guess = guess_fetcher()
result = play_round(guess=guess, game=game)
response_logger(result.message)
game = result.game
outcome = result.outcome
Pepper in the few imports we need:
from typing import Callable
from literate_wordle.game import WordleGame, WordleMoveOutcome, play_round
from literate_wordle.words import pick_answer_word
Now we can add command line argument parsing in a separate file:
def parse_args(raw_args: Optional[Sequence[str]] = None) -> argparse.Namespace:
"""Parse given command line arguments"""
description = "Wordle implementation in Python, as literate programming"
# Bit overkill since there is no real argument to parse yet
parser = argparse.ArgumentParser(prog="pywordle", description=description)
return parser.parse_args(raw_args)
import argparse
from typing import Optional, Sequence
def play_game_args(raw_args: Optional[Sequence[str]] = None):
"""Play a standard Wordle game from stdin to stdout, given args"""
_ = parse_args(raw_args)
game = new_game()
play_game(game=game, guess_fetcher=input, response_logger=print)
def main():
"""Pass sys.argv to the play_game_args function"""
play_game_args(sys.argv[1:])
import sys
from literate_wordle.main import new_game, play_game
Since both our main and cli are meant to be untestable, because it’s the interactive entrypoint, it’s a bit unfair to compute coverage over it. Let’s blacklist these two files, preventing them weighing down coverage metric.
[run]
omit =
# Don't compute coverage for these 2 manual invocation files
src/literate_wordle/main.py
src/literate_wordle/cli.py
"""Entrypoint for pywordle"""
<<cli-main-import-std>>
<<cli-main-import-mod>>
<<cli-main1>>
<<cli-main2>>
<<cli-main3>>
"""Command line entrypoint for pywordle"""
<<cli-pargs-import-std1>>
<<cli-pargs-import-std3>>
<<cli-pargs-import-std2>>
<<cli-pargs-import-mod>>
<<cli-pargs1>>
<<cli-pargs2>>
<<cli-pargs3>>
In Python, when using Poetry like we are, the package is defined in
pyproject.toml
. To define a new command, this means using the
tool.poetry.script
key:
[tool.poetry.scripts]
pywordle = "literate_wordle.cli:main"
So we can now manually invoke this tool. And for the given argument parser, a help message should be available:
poetry run pywordle --help
usage: pywordle [-h]
:
Wordle implementation in Python, as literate programming
:
optional arguments: -h, --help show this help message and exit
And we can play a round!
$ poetry shell $ pywordle hello Try again! Guess number 1. Score is: ⬜🟨🟨⬜🟨 lobes Try again! Guess number 2. Score is: 🟨🟩⬜🟩⬜ cranes Guess too long crane Try again! Guess number 3. Score is: ⬜⬜⬜🟨🟨 novel Correct! Game won in 4 guesses
Taking a step back, we’ve got command line launch of the game, and we can play with it. We’re done here, especially for a short experimental project.
But if this codebase was to be maintained, extended, reused, the bar for “acceptable” test coverage would be much higher.
For instance, we have no test overall on the game loop of guess input/output,
despite all the layers below being pretty well covered. So I’d want tests that
call the play_game
function with scripted inputs and log the outputs, taking
advantage of the dependency injection we set up to make proper UI-oriented
tests. These would reveal, for instance, that when launching the game, there is
nothing greeting us, no prompt for a guess, which is a usability issue.
In our case, that’s an exercise left for the reader.
Remember that testing’s primary goal is to increase our trust in the system we build.
In that vein, because we’ve got feature acceptance tests covered for every layer, the biggest source of uncertainty in the system is the implementation itself: we’re just not shaking out the code very much, beyond what a normal usage would look like. This calls for exploring the edge cases that code may have, regardless of intended features. Every string parameter should be tried with empty string, uppercase vs lowercase, different encoding, etc.
We just walked through building a simple wordle program from scratch, using literate programming to weave a novel’s worth of explanations and reasoning, with code blocks that export to the proper project code locations.
The project uses modern Python tooling (poetry, pytest) and uses formatters (black, isort), linters (flake8 with plugins), type checkers (mypy), and the project generates its own general documentation (including this page, if you’re reading it in a browser) and API reference (Sphinx with myst_parser for Markdown support), enforcing compliance of every tool via make and pre-commit.
The code was written in a Test-driven (TDD) way, as the tests always came before the feature itself, guiding how the implementation looks like, all the way to having 100% test coverage (whatever that means).
More importantly in my eyes, we only built what was strictly necessary, by using
Behaviour-driven development (BDD, also called acceptance-test-driven
development) to guide what subfeature to build next based on our needs. These
specifications were encoded as Gherkin Features, available in a dedicated
features/
folder, and thanks to the magic of Sphinx documentation, each of
those are collected into a list of requirements in a dedicated Requirements
page of the docs.
Since all of the feature files have associated acceptance tests that match the phrasing of the Gherkin features, future automation work could look at linking the requirements in Sphinx to the associated test file, so as to finally get full traceability from requirements, through specifications, to implementation and finally acceptance tests that pass.
This project was my first foray into literate programming at this scale, an attempt to bring together all the good ideas of TDD, modern Python development, Gherkin usage for requirements traceability purposes (without overly zealous extremes of Cucumber automation). All these ideas were until now scattered, implemented each without the others in different places, and this project fuses them into something I hope is more valuable than the sum of its parts.
If you like what you see here, have a look at my other writings, available on my blog: https://jiby.tech.
A few weeks after initial release of the project, reader @gpiancastelli helpfully reported a major bug relating to guess scoring via Github. In this post-script note, I want to report here the process of investigating the bug, present how dissecting the issue made the fix emerge, and reflect on how such a bug could sneak in despite our careful approach.
I’m painfully aware of the ironic (and embarassing) aspect of writing a whole novel about “programming using best practices” only to get such a crucial point very, very wrong. It would be easy to hide this bug, retroactively change the narrative above, and pretend we got it right the first time. Instead, I believe there’s a lesson worth learning and sharing in there.
The original bug report states (slightly abridged):
There’s a bug in your score_guess function. If the guess contains two copies of a letter, and that letter is present only once in the answer, and the second copy in the guess matches that letter in the answer, the first copy will be marked as WRONG_PLACE, while the second copy will be marked as NO.
[…] Let’s say we have
A__A_
as our guess and___A_
as the answer. Your score_guess function will return🟨__⬜_
instead of⬜__🟩_
.
Incorrect scoring function sounds very serious indeed, so the first step is confirming the issue with a good testcase. Can we find words that match the rule:
# Pick an answer word ending with "n"
zgrep -iE "n\b" ./src/literate_wordle/assets/wordle_answers_dict.txt.gz
# Pick a guess-word ending with "n", and with another "n"
zgrep -iE "n.*n\b" ./src/literate_wordle/assets/wordle_accepted_words_dict.txt.gz
From the many results (those regular expressions are fairly vague), I manually
chose the answer train
and the guess xenon
.
We want to show that score_guess
is wrong, which is best done by adding a
case to test_generic_score
:
@pytest.mark.parametrize(
"answer,our_guess,expected_score",
[
pytest.param("adage", "adobe", "🟩🟩⬜⬜🟩", id="normal_guess1"),
pytest.param("serif", "quiet", "⬜⬜🟨🟨⬜", id="normal_guess2"),
pytest.param("raise", "radix", "🟩🟩⬜🟨⬜", id="normal_guess3"),
pytest.param("abbey", "kebab", "⬜🟨🟩🟨🟨", id="multi_occur1"),
pytest.param("abbey", "babes", "🟨🟨🟩🟩⬜", id="multi_occur2"),
pytest.param("abbey", "abyss", "🟩🟩🟨⬜⬜", id="multi_occur3"),
pytest.param("abbey", "algae", "🟩⬜⬜⬜🟨", id="multi_occur4"),
pytest.param("abbey", "keeps", "⬜🟨⬜⬜⬜", id="multi_occur5"),
pytest.param("abbey", "abate", "🟩🟩⬜⬜🟨", id="multi_occur6"),
pytest.param("train", "xenon", "⬜⬜⬜⬜🟩", id="multi_occur_issue1"),
],
)
Let’s run the tests to see the result:
make test
poetry run pytest ============ test session starts ============= platform linux -- Python 3.9.5, pytest-7.1.2, pluggy-1.0.0 -- /home/jiby/dev/ws/short/literate_wordle/.venv/bin/python cachedir: .pytest_cache rootdir: /home/jiby/dev/ws/short/literate_wordle, configfile: pyproject.toml plugins: cov-3.0.0, clarity-1.0.1 collected 23 items tests/test_checking_guess_valid_word.py::test_reject_long_words PASSED [ 4%] tests/test_checking_guess_valid_word.py::test_reject_overly_short_words PASSED [ 8%] tests/test_checking_guess_valid_word.py::test_reject_nondict_words PASSED [ 13%] tests/test_checking_guess_valid_word.py::test_accept_dict_words PASSED [ 17%] tests/test_pick_word.py::test_pick_word_ok_length PASSED [ 21%] tests/test_scoring_guess.py::test_perfect_guess PASSED [ 26%] tests/test_scoring_guess.py::test_no_common_character PASSED [ 30%] tests/test_scoring_guess.py::test_wrong_place PASSED [ 34%] tests/test_scoring_guess.py::test_generic_score[normal_guess1] PASSED [ 39%] tests/test_scoring_guess.py::test_generic_score[normal_guess2] PASSED [ 43%] tests/test_scoring_guess.py::test_generic_score[normal_guess3] PASSED [ 47%] tests/test_scoring_guess.py::test_generic_score[multi_occur1] PASSED [ 52%] tests/test_scoring_guess.py::test_generic_score[multi_occur2] PASSED [ 56%] tests/test_scoring_guess.py::test_generic_score[multi_occur3] PASSED [ 60%] tests/test_scoring_guess.py::test_generic_score[multi_occur4] PASSED [ 65%] tests/test_scoring_guess.py::test_generic_score[multi_occur5] PASSED [ 69%] tests/test_scoring_guess.py::test_generic_score[multi_occur6] PASSED [ 73%] tests/test_scoring_guess.py::test_generic_score[multi_occur_issue1] FAILED [ 78%] tests/test_track_guess_number.py::test_first_guess_allowed PASSED [ 82%] tests/test_track_guess_number.py::test_sixth_guess_allowed PASSED [ 86%] tests/test_track_guess_number.py::test_seventh_guess_fails_game PASSED [ 91%] tests/test_track_guess_number.py::test_winning_guess_wins PASSED [ 95%] tests/test_track_guess_number.py::test_invalid_guess_not_counted PASSED [100%] ================== FAILURES ================== ___ test_generic_score[multi_occur_issue1] ___ answer = 'train', our_guess = 'xenon' expected_score = '⬜⬜⬜⬜🟩' @pytest.mark.parametrize( "answer,our_guess,expected_score", [ pytest.param("adage", "adobe", "🟩🟩⬜⬜🟩", id="normal_guess1"), pytest.param("serif", "quiet", "⬜⬜🟨🟨⬜", id="normal_guess2"), pytest.param("raise", "radix", "🟩🟩⬜🟨⬜", id="normal_guess3"), pytest.param("abbey", "kebab", "⬜🟨🟩🟨🟨", id="multi_occur1"), pytest.param("abbey", "babes", "🟨🟨🟩🟩⬜", id="multi_occur2"), pytest.param("abbey", "abyss", "🟩🟩🟨⬜⬜", id="multi_occur3"), pytest.param("abbey", "algae", "🟩⬜⬜⬜🟨", id="multi_occur4"), pytest.param("abbey", "keeps", "⬜🟨⬜⬜⬜", id="multi_occur5"), pytest.param("abbey", "abate", "🟩🟩⬜⬜🟨", id="multi_occur6"), pytest.param("train", "xenon", "⬜⬜⬜⬜🟩", id="multi_occur_issue1"), ], ) def test_generic_score(answer, our_guess, expected_score): """Scenario Outline: Scoring guesses""" # Given a wordle <answer> # When scoring <guess> score = score_guess(our_guess, answer) # Then score should be <score> > assert score == expected_score E assert == failed. [pytest-clarity diff shown] E E LHS vs RHS shown below E E ⬜⬜🟨⬜⬜ E ⬜⬜⬜⬜🟩 E tests/test_scoring_guess.py:68: AssertionError - generated xml file: /home/jiby/dev/ws/short/literate_wordle/test_results/results.xml - ----------- coverage: platform linux, python 3.9.5-final-0 ----------- Name Stmts Miss Cover ------------------------------------------------------------ src/literate_wordle/__init__.py 0 0 100% src/literate_wordle/assets/__init__.py 0 0 100% src/literate_wordle/game.py 38 0 100% src/literate_wordle/guess.py 25 0 100% src/literate_wordle/words.py 32 0 100% ------------------------------------------------------------ TOTAL 95 0 100% Coverage HTML written to dir test_results/coverage.html Coverage XML written to file test_results/coverage.xml ========== short test summary info =========== FAILED tests/test_scoring_guess.py::test_generic_score[multi_occur_issue1] ======== 1 failed, 22 passed in 0.15s ======== make: *** [Makefile:16: test] Error 1
Bug confirmed! Whoops.
If necessary, we can step through the example code to figure out what’s wrong, and I did. But overall, it seems that our approach to scoring by looking at character in a single pass is at fault.
The approach falls down with the example we were given, because we don’t first
detect the second n
of xenon
as matching the last n
of train
, which
would make it scored OK (🟩), then in another pass detecting remaining,
non-matching (⬜) in the first n
. Instead, we run over characters in order,
detect a n
in the wrong place, score it as wrong-place (🟨), and by decreasing
the occurence counter, the next one is counted non-matching (⬜), hence the bad
score.
Thinking it through, it means that the single-pass scoring approach just cannot work, as we need to “look ahead”, knowing already the OK-ness of all guess characters before scoring the wrong-place-ness. Interesting!
So we will re-write this algorithm to work in two passes: First, detect exact matches of guess/answer character pairs, recording those as perfect score. Then, a second pairwise check looks for wrong-place score, defaulting to the mismatch “zero” score.
In order to score “out of order” (in two passes), the response
needs to change
from the original empty string being built, to some random-access structure: a
list
.
In designing the fix, we realise that a zero score, aka all-mismatch (⬜⬜⬜⬜⬜) is the “default” case of scoring. That is we “start” from that score, and score “up” by marking individual characters as matching.
We reflect that in the list initialisation, starting with the worst score as it means we avoid having to “detect” it anymore. That’s a tiny optimization of the code. But more importantly, this list is now randomly accessible, as we can now “peek ahead” when we couldn’t before.
def score_guess(guess: str, answer: str) -> str:
"""Score an individual guess with Counter"""
# Counter("abbey") = Counter({'b': 2, 'a': 1, 'e': 1, 'y': 1})
answer_chars = Counter(answer)
# NO is the default score, no need to detect it explicitly
response: list[str] = [CharacterScore.NO] * len(answer)
# First pass to detect perfect scores
for char_index, (answer_char, guess_char) in enumerate(zip(guess, answer)):
if answer_char == guess_char:
response[char_index] = CharacterScore.OK
answer_chars[guess_char] -= 1
# Second pass for the yellows
for char_num, (guess_char, existing_score) in enumerate(zip(guess, response)):
if existing_score == CharacterScore.OK:
continue # It's already green: skip
if answer_chars[guess_char] > 0:
response[char_num] = CharacterScore.WRONG_PLACE
# Reduce occurence counter since we "used" this occurence
answer_chars[guess_char] -= 1
return "".join(response)
Note another minor change, we removed the check for guess_char in
answer_chars
. This was previously there to catch the case where the
answer_chars
dictionary didn’t have an entry for this guess_char
, which
meant trying to access it would raise a KeyError
, so we’d protect agaisnt
that.
But as @gpiancastelli also pointed out, a collections.Counter
isn’t a regular
dictionary, the documentation says: “Counter objects have a dictionary interface
except that they return a zero count for missing items.”. This helpful
divergence from regular dictionaries protects us already from that missing key
issue, so the code can flow just a little more smoothly.
Had this been a raw dict, not a Counter
, we could have used the get
operator
to set a default value on missing key, in the form answer_chars.get(guess_char,
0)
. We’d be trading off clarity for briefness. Not as elegant as what Counter
allows!
Still, the bug is fixed, as attested by tests going green again. We also check linters are happy and coverage is good (they are, it is). All is well!
We just re-defined (overwrote) a few code blocks from previous sections, so we need to re-weave them together into a real file.
If we just “fixed” the tangling blocks of above, the story wouldn’t be in order, wouldn’t make sense.
So we redefine a few files here:
Examples: Reported bug: multiple occurence of same character in guess
| answer | guess | score |
| train | xenon | ⬜⬜⬜⬜🟩 |
"""Validates the Gherkin file features/scoring_guess.feature:
<<scoring-feature>>
"""
<<scoring-test-import-pytest>>
<<scoring-test-import>>
<<scoring-test1>>
<<scoring-test2>>
<<scoring-test3>>
<<scoring-multi-parameters2>>
<<scoring-multi-skeleton>>
<<scoring-guessmod-header>>
<<scoring-guessfunc-import>>
<<scoring-guess-enum-import>>
<<scoring-guess-enum>>
<<track-guess-perfectscore>>
<<scoring-guessfunc-impl3>>
We just found a bug, and fixed it. But why didn’t we catch it earlier!? Is TDD and BDD at fault? Can we just go back to coding without tests!?
I like to think that the process didn’t fail as much as my imagination did.
First, note how the Gherkin features, requirements gathering and so on did their job, we adequately planned for features, defined scenarios that did make sense, and implemented those correctly. So the BDD side delivered its value!
Purely TDD-wise, all the tests we defined were valid, and covered reasonable aspect of the features to help design the new functions’ shapes, nothing to say there either.
The failing was in the (lack of) diversity of scores used as examples: we didn’t cover a broad enough set of score samples to find issues like this one.
But finding this bug isn’t obvious: if you didn’t know about this particular bug (by reading the sourcecode and seeing a really non-obvious flaw), finding the bug would instead require playing randomly this game’s implementation until you find a bad score (which could take minutes or hours, due to the randomness involved), then reproducing the example + reporting it. This is likely what the bug reporter did, played around and found a bad case.
As a developer, I didn’t have any particular reason to suspect this specific scoring issue, so I didn’t develop a test case with it.
But I like to think that I was so close!
As you see in sections above, I was worried about scoring for multiple letters, as shown in the scoring example table. I remember this being a concern, because any naive implementation of wordle could miss the nuance of “the real Worlde”. I even broke out screenshots from the real Wordle website to make up some references, because I couldn’t explain to myself how the scoring should happen.
Unfortunately, my attention was on multiple identical characters in the answer, not in the guess.
So, again, I was close enough to look for similar bugs, but didn’t quite find a diverse enough set of sample scores to unearth this particular issue.
Before we go, I want to flip the narrative around this bug:
The way I see it, I built a fun implementation of Wordle to play with Python, TDD and BDD. I spent a reasonable amount of time on “due diligence research” around edge cases (seen in above section) to feel good about the solution.
Isolating the bug (by adding a single line to the tests), and fixing it (a few paragraphs, one function) was a minuscule amount of additional effort, thanks to our strong test harness.
Avoiding the bug in the first place would have cost a lot more time, doing research into 100% compatibility with existing Wordle implementations, likely having to connect someone else’s Wordle code to ours to compare (with all the associated issues to deal with), for comparatively minor benefits.
This isn’t NASA, who has a single chance to send rockets, and (comparatively) infinite engineering time to plan it. In our case, the cost of making the system robust can be prohibitive. The discipline of Engineering is about balancing acceptable risks against the costs of reliability.
So, despite having to issue a rectification to this narrative, I still believe the amount of pre-production research was sufficient: We did nothing wrong here.
This bugfix also showcases the iterative nature of software development: Earlier sections demonstrated feature addition as incremental changes, but we see here that refining the solution when it’s subtly wrong is an iterative process too!
So, yeah, building code to be correct the first time is hard. Or maybe almost impossible. Or even not the best course of action for you!
The best way to build code is to “make it work, make it right, then make it fast” in that order.