Skip to content

Latest commit

 

History

History
3265 lines (2618 loc) · 133 KB

wordle.org

File metadata and controls

3265 lines (2618 loc) · 133 KB

Wordle

I’ve recently written a series of blog posts about Gherkin, the Behaviour-driven development movement, and how Cucumber (the BDD tool of choice) failed to perform to expectations.

I wanted to showcase the BDD-inspired low-tech solution I came up with via a toy project, demonstrating a small but significant programming task, broken down as series of design-implementation cycles.

Wordle is a perfect target: it’s a small codebase, with a half dozen features to string together into a useable game.

In order to document the process, the code is written via literate programming.

Literate programming is the art of writing code as if it was a novel (or blogpost), writing down what’s needed, explaining the reasoning, and weaving in code snippets that add up to the codebase as we grow in understanding. The result is a “story” which can be read, but also “tangled” back into a proper codebase that works normally.

For more context on the code repository (how to use, etc), please read the project readme.

See also the online, pretty rendered version of this document on my personal website: https://jiby.tech/project/literate_wordle/wordle.html

Mise en bouche: picking an answer

To get us started, let’s cover the very first behaviour Wordle has to do: pick a word that will become our secret answer.

As the first iteration in a test-driven project, it’s important that we set up all the components we’ll need going forwards.

First, let’s formalise a little our first requirement, using Gherkin Features. For context as to why/how we’re doing this, read my post on gathering requirements via Gherkin.

Feature: Pick an answer word
  As a Wordle game
  I need to pick a random 5 letter word
  In order to let players guess it

Right. That’s fairly straightforward, but the secret word can’t just be random characters, it needs to be a proper word. So we need to find a dictionary to pick from.

TDD for picking word functionality

We want to write a test that validates that we can indeed pick a random word. But “Random” and “test” together should make anybody wince at the idea of non-deterministic testing.

We could write a test that picks a word, then confirm the word came from the dictionary file, but writing test would mean re-implementing the entirety of the feature we’re testing, as well as rely on the internals of the implementation being correct. That’s very wrong.

A good alternative would be to pin down the randomness (making the test deterministic) by anchoring the randomness seed to known value, allowing repeatable testing. But this is just the first test in a new project, so we want a simple check to start with, so we compromise by making the assertion “is the random word picked of five letter length”?

So we write down a new test file, under tests/ folder, starting with a file-level docstring that references the Gherkin feature this enforces.

"""Validates the Gherkin file features/pick_answer_word.feature:

Feature: Pick an answer word
  As a Wordle game
  I need to pick a random 5 letter word
  In order to let players guess it
"""

from literate_wordle.words import pick_answer_word


def test_pick_word_ok_length():
    """Confirm a wordle solution is of right size"""
    assert len(pick_answer_word()) == 5, "Picked wordle solution is wrong size!"

Of course, since that feature isn’t implemented (not even the module’s skeleton), running tests right now would crash as import errors, rather than give a red light.

So let’s implement the barest hint of the pick_answer_word function that returns the wrong thing, to make the test run and fail:

"""Dictionary features to back wordle solutions"""

In that module, let’s add the skeleton for our pick_answer_word function, but return an invalid result, to make test explicitly fail:

def pick_answer_word() -> str:
    """Pick a Wordle solution/answer from wordle dictionary"""
    return ""  # Incorrect solution to get RED test

With our test ready, and a dummy function in place, let’s see the tests go red:

make test
poetry run pytest
============================= test session starts ==============================
platform linux -- Python 3.9.5, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/jiby/dev/ws/short/literate_wordle/.venv/bin/python
cachedir: .pytest_cache
rootdir: /home/jiby/dev/ws/short/literate_wordle, configfile: pyproject.toml
plugins: cov-3.0.0, datadir-1.3.1, clarity-1.0.1
collecting ... collected 2 items

tests/test_pick_word.py::test_pick_word_ok_length FAILED                 [ 50%]
tests/test_version.py::test_version PASSED                               [100%]

=================================== FAILURES ===================================
___________________________ test_pick_word_ok_length ___________________________

    def test_pick_word_ok_length():
        """Confirm a wordle solution is of right size"""
>       assert len(pick_answer_word()) == 5, "Picked wordle solution is wrong size!"
E       AssertionError: Picked wordle solution is wrong size!
E       assert == failed. [pytest-clarity diff shown]
E
E         LHS vs RHS shown below
E
E         0
E         5
E

tests/test_pick_word.py:13: AssertionError
- generated xml file: /home/jiby/dev/ws/short/literate_wordle/test_results/results.xml -
=========================== short test summary info ============================
FAILED tests/test_pick_word.py::test_pick_word_ok_length - AssertionError: Pi...
========================= 1 failed, 1 passed in 0.07s ==========================
make: *** [Makefile:16: test] Error 1

As pytest mentions, we should see a wordle solution of 5 letters, not zero. So the test indeed failed as expected, we can now make it pass by implementing the feature.

Taking a quick step back, think of how conveniently TDD lets us “dream up an API”, by describing functions and files that don’t need to exist yet.

Solutions dictionary file

Since we’re trying to match the Wordle website’s implementation, let’s reuse Wordle’s own dictionary. Someone helpfully uploaded it. Let’s download it:

wget \
    --output-document "wordle_answers_dict.txt" \
    "https://raw.githubusercontent.com/AllValley/WordleDictionary/6f14d2f03d01c36fe66e3ccc0929394251ab139d/wordle_solutions_alphabetized.txt"

Except an alphabetically sorted text file takes space for no good reason. Let’s compress it preventively.

While this can legitimately be seen as a premature optimization, we can see this as trying to “flatten” a static text file into a binary “asset” that can be packaged into the project’s package, like icons are part of webapps.

ANSWERS_FILE="wordle_answers_dict.txt"
# Get raw file size in kilobytes
du -k "${ANSWERS_FILE}"
# Compress the file (removes original)
gzip "$ANSWERS_FILE"
# Check size after compression
du -k "${ANSWERS_FILE}.gz"
16	wordle_answers_dict.txt
8	wordle_answers_dict.txt.gz

Sweet, we have cut down the filesize by half.

Importing dictionary: static/packaged asset file read

At first glance, the implementation of the function we want is simple, it looks roughly like this:

with open("my_dictionary.txt", "r") as fd:
    my_text = fd.read()

One just needs to find the right file path to open, just add sprinkles to deal with compression. Sure enough, that is fairly easy.

The issue is that we’re trying to write a python package here, which means it could be downloaded via pip install and installed in an arbitary location on someone’s computer. Our code needs to refer to the file as “the file XYZ inside the assets folder of our package”. We need to look up how to express that.

From Stackoverflow on reading static files from inside Python package, we can use the importlib.resources module, since our project requires Python 3.9 onwards.

So we’ll move our dictionary zip file into a new module (folder) called assets, which will be a proper python module that can be imported from:

mkdir -p src/literate_wordle/assets/
# A proper python module means an __init__.py: Give it a docstring
echo '"""Static binary assets (dictionaries) required to perform Wordle"""' > src/literate_wordle/assets/__init__.py
mv wordle_answers_dict.txt.gz src/literate_wordle/

With the file in correct position, let’s redefine the words module we left empty, to provide the pick_answer_word function.

"""Dictionary features to back wordle solutions"""
import gzip
import importlib.resources as pkg_resources
from . import assets  # Relative import of the assets/ folder

We need a convenience function to load the zip file into a list of strings.

def get_words_list() -> list[str]:
    """Decompress the wordle dictionary"""
    dict_compressed_bytes = pkg_resources.read_binary(
        assets, "wordle_answers_dict.txt.gz"
    )
    dict_string = gzip.decompress(dict_compressed_bytes).decode("ascii")
    answer_word_list = [word.strip().lower().strip() for word in dict_string.split("\n")]
    return answer_word_list

Ideally we would make a test dedicated for proving this function, but our already-failing acceptance test is pretty much covering this entire feature, so it’s not worth it just now. This is one of those tradeoffs we make between toy projects and long-term maintainability of code as a team.

With the word list in hand, writing out the pick function is trivial:

from random import choice
def pick_answer_word() -> str:
    """Pick a single word out of the dictionary of answers"""
    return choice(get_words_list())

With the function implemented, we can try it out in a Python REPL (Read Eval Print Loop, also known as an interactive interpreter):

poetry run python3
>> from literate_wordle import words
>> print(words.pick_answer_word())
stink
>> print(words.pick_answer_word())
blank

Perfect! So the test should now pass, right?

make test
poetry run pytest
============================= test session starts ==============================
platform linux -- Python 3.9.5, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/jiby/dev/ws/short/literate_wordle/.venv/bin/python
cachedir: .pytest_cache
rootdir: /home/jiby/dev/ws/short/literate_wordle, configfile: pyproject.toml
plugins: cov-3.0.0, datadir-1.3.1, clarity-1.0.1
collecting ... collected 2 items

tests/test_pick_word.py::test_pick_word_ok_length PASSED                 [ 50%]
tests/test_version.py::test_version PASSED                               [100%]

- generated xml file: /home/jiby/dev/ws/short/literate_wordle/test_results/results.xml -
============================== 2 passed in 0.03s ===============================

Acceptance tests pass, and linters are happy (not pictured, use make to check).

Because the acceptance test pass, that means the feature is ready to ship! That’s the BDD guarantee.

Of course, keen readers will notice sub-optimal code, like how we’re unzipping the entire solutions file on each requested answer. Because “picking a solution word” is something done on the order of once over the entire runtime of a Wordle session, we choose to leave this performance wart be.

Debriefing on the method

We just completed our first loop: determine a small component that needs implemented to build towards the Wordle goal, spell it out with Gherkin features, explicit the feature via acceptance test, and iterate on the new RED test until it becomes green, then ship the feature.

Common TDD workflow adds a refactor or “blue” component to the cycle, which is indeed necessary for production code, as it lends maintainability (the first draft of a codebase is usually taking big shortcuts). But this project is meant as entertainment material, and proper refactoring would mean refactoring the wordle.org source file, which would drown out the nice narrative we’re building here, so let’s leave it here.

Along the way, the code blocks spelled out in this narrative-oriented file is tangled out into proper code paths, so that the Makefile can pick it up and validate the proper package-ness. We’ll see as we implement the next feature how such a weaving of code snippets works.

Confirming guess is a valid word

Now that we can pick secret words, we need to start processing guesses. The very first thing we need is validating guesses are proper words, and of the right size. This feature will give us a familiar context (dictionaries), while slowly ramping up the details of the Gherkin features:

Feature: Checking a guess is a valid word
  As a Wordle game
  I need to confirm each guessed word is valid
  So that I only accept real words, no kwyjibo

In practice, this means multiple things:

Scenario: Reject long words
  When guessing "affable"
  Then the guess is rejected
  And reason for rejection is "Guess too long"

Scenario: Reject short words
  When guessing "baby"
  Then the guess is rejected
  And reason for rejection is "Guess too short"

Scenario: Reject fake words via dictionary
  When guessing "vbpdj"
  Then the guess is rejected
  And reason for rejection is "Not a word from the dictionary"

Scenario: Accept five letter dictionary words
  When guessing "crane"
  Then the guess is accepted

So, with a feature covering these scenarios, we can start laying out acceptance tests.

Since I quite like to use the Gherkin feature file inside the docstrings of Python tests, I’m going to take advantage of having already written the feature above, to reference it, so I can template it out in code snippets:

"""Validates the Gherkin file features/checking_guess_valid_word.feature:

<<feature-check-valid-guess>>
"""

Just this once, I’ll show how the templating happens behind the scene:

"""Validates the Gherkin file features/checking_guess_valid_word.feature:

<<feature-check-valid-guess>>

<<scenario-check-valid-guess>>
"""

Test setup

With the feature described, let’s import our hypothetical test code

from literate_wordle.words import check_valid_word
def test_reject_long_words():
    """Scenario: Reject long words"""
    # When guessing "affable"
    guess = "affable"
    is_valid, reject_reason = check_valid_word(guess)
    # Then the guess is rejected
    assert not is_valid, "Overly long guess should have been rejected"
    # And reason for rejection is "Guess too long"
    assert reject_reason == "Guess too long"

Notice the pattern of referencing the Gherkin Scenario as comments inside the test. This practice is something I came up with on my own after being a bit disappointed with Cucumber. You can read more about it in my post on low-tech cucumber replacement.

def test_reject_overly_short_words():
    """Scenario: Reject short words"""
    # When guessing "baby"
    guess = "baby"
    is_valid, reject_reason = check_valid_word(guess)
    # Then the guess is rejected
    assert not is_valid, "Overly short guess should have been rejected"
    # And reason for rejection is "Guess too short"
    assert reject_reason == "Guess too short"

And finally, the dictionary checks:

def test_reject_nondict_words():
    """Scenario: Reject fake words via dictionary"""
    # When guessing "vbpdj"
    guess = "vbpdj"
    is_valid, reject_reason = check_valid_word(guess)
    # Then the guess is rejected
    assert not is_valid, "Word not in dictionary should have been rejected"
    # And reason for rejection is "Not a word from the dictionary"
    assert reject_reason == "Not a word from the dictionary"
def test_accept_dict_words():
    """Scenario: Accept five letter dictionary words"""
    # When guessing "crane"
    guess = "crane"
    is_valid, reject_reason = check_valid_word(guess)
    # Then the guess is accepted
    assert is_valid, "Correct length word in dictionary should have been accepted"

One tiny detail regarding this last example, which highlights why separating Gherkin from actual code is important: We describe in the positive scenario the need to accept a correct word in terms of “not rejecting”, which in code maps to the is_valid boolean. That’s suffficient to validate the originalGherkin scenario, which is what we think of when designing the software.

But as we see in the implementation, there’s also the matter of the reject_reason component, which we should check for emptiness. That emptiness is an implementation detail, which has no reason to be laid out in the original scenario, but is still valid to make assertions on as part of the implementation’s check. So we add the following line to the test:

assert reject_reason is None, "Accepted word should have no reason to be rejected"

With all these (high level) tests in hand, let’s write up some small implementation to get RED tests instead of a crash.

First up is defining the function’s signature: Simple enough, we take a string guess in, and return a boolean and a string for justification. Except sometimes (as seen in Listing reject-reason-none) the reason is None, so that’s more of an Optional string, which we’ll need to import.

from typing import Optional
def check_valid_word(guess: str) -> tuple[bool, Optional[str]]:
"""Pretends to check if guess is a valid word"""
return False, "Not implemented"

All right, so we have tests, let’s see them fail!

make test 2>&1 || true
poetry run pytest
============================= test session starts ==============================
platform linux -- Python 3.9.5, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/jiby/dev/ws/short/literate_wordle/.venv/bin/python
cachedir: .pytest_cache
rootdir: /home/jiby/dev/ws/short/literate_wordle, configfile: pyproject.toml
plugins: cov-3.0.0, clarity-1.0.1
collecting ... collected 5 items

tests/test_checking_guess_valid_word.py::test_reject_long_words FAILED   [ 20%]
tests/test_checking_guess_valid_word.py::test_reject_overly_short_words FAILED [ 40%]
tests/test_checking_guess_valid_word.py::test_reject_nondict_words FAILED [ 60%]
tests/test_checking_guess_valid_word.py::test_accept_dict_words FAILED   [ 80%]
tests/test_pick_word.py::test_pick_word_ok_length PASSED                 [100%]

=================================== FAILURES ===================================
____________________________ test_reject_long_words ____________________________

    def test_reject_long_words():
        """Scenario: Reject long words"""
        # When guessing "affable"
        guess = "affable"
        is_valid, reject_reason = check_valid_word(guess)
        # Then the guess is rejected
        assert not is_valid, "Overly long guess should have been rejected"
        # And reason for rejection is "Guess too long"
>       assert reject_reason == "Guess too long"
E       assert == failed. [pytest-clarity diff shown]
E
E         LHS vs RHS shown below
E
E         Not implemented
E         Guess too long
E

tests/test_checking_guess_valid_word.py:39: AssertionError
________________________ test_reject_overly_short_words ________________________

    def test_reject_overly_short_words():
        """Scenario: Reject short words"""
        # When guessing "baby"
        guess = "baby"
        is_valid, reject_reason = check_valid_word(guess)
        # Then the guess is rejected
        assert not is_valid, "Overly short guess should have been rejected"
        # And reason for rejection is "Guess too short"
>       assert reject_reason == "Guess too short"
E       assert == failed. [pytest-clarity diff shown]
E
E         LHS vs RHS shown below
E
E         Not implemented
E         Guess too short
E

tests/test_checking_guess_valid_word.py:50: AssertionError
__________________________ test_reject_nondict_words ___________________________

    def test_reject_nondict_words():
        """Scenario: Reject fake words via dictionary"""
        # When guessing "vbpdj"
        guess = "vbpdj"
        is_valid, reject_reason = check_valid_word(guess)
        # Then the guess is rejected
        assert not is_valid, "Word not in dictionary should have been rejected"
        # And reason for rejection is "Not a word from the dictionary"
>       assert reject_reason == "Not a word from the dictionary"
E       assert == failed. [pytest-clarity diff shown]
E
E         LHS vs RHS shown below
E
E         Not implemented
E         Not a word from the dictionary
E

tests/test_checking_guess_valid_word.py:61: AssertionError
____________________________ test_accept_dict_words ____________________________

    def test_accept_dict_words():
        """Scenario: Accept five letter dictionary words"""
        # When guessing "crane"
        guess = "crane"
        is_valid, reject_reason = check_valid_word(guess)
        # Then the guess is accepted
>       assert is_valid, "Correct length word in dictionary should have been accepted"
E       AssertionError: Correct length word in dictionary should have been accepted
E       assert False

tests/test_checking_guess_valid_word.py:70: AssertionError
- generated xml file: /home/jiby/dev/ws/short/literate_wordle/test_results/results.xml -

----------- coverage: platform linux, python 3.9.5-final-0 -----------
Name                                     Stmts   Miss  Cover
------------------------------------------------------------
src/literate_wordle/__init__.py              1      0   100%
src/literate_wordle/assets/__init__.py       0      0   100%
src/literate_wordle/words.py                14      0   100%
------------------------------------------------------------
TOTAL                                       15      0   100%
Coverage HTML written to dir test_results/coverage.html
Coverage XML written to file test_results/coverage.xml

=========================== short test summary info ============================
FAILED tests/test_checking_guess_valid_word.py::test_reject_long_words - asse...
FAILED tests/test_checking_guess_valid_word.py::test_reject_overly_short_words
FAILED tests/test_checking_guess_valid_word.py::test_reject_nondict_words - a...
FAILED tests/test_checking_guess_valid_word.py::test_accept_dict_words - Asse...
========================= 4 failed, 1 passed in 0.13s ==========================
make: *** [Makefile:16: test] Error 1

Test failure as expected, and enjoy that 100% coverage![fn::Obviously coverage metric is a very fuzzy number which doesn’t guarantee much, but most well maintained code has a tendency to have good coverage, because the features are well tested. It’s a correlation-metric, nothing more. In our case, we’re doing TDD (test goes first indeed) and we’re pushing this even more to explicit our user requirements as acceptance tests, it should be no surprise the coverage gets good.]

Implementing the feature, one test at a time

Let’s implement the proper feature. First of all, we replace the function stub’s body to do only guess-length checks, run tests against it. Since we implement half the feature (by Scenarios), we should be seeing half as many tests fail as before.

"""Check wordle guess length only, no dict checks"""
answer_length = 5
guess_length = len(guess)
if guess_length < answer_length:
    return False, "Guess too short"
if guess_length > answer_length:
    return False, "Guess too long"
return True, None  # No dictionary check
make test 2>&1 || true
poetry run pytest
============================= test session starts ==============================
platform linux -- Python 3.9.5, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/jiby/dev/ws/short/literate_wordle/.venv/bin/python
cachedir: .pytest_cache
rootdir: /home/jiby/dev/ws/short/literate_wordle, configfile: pyproject.toml
plugins: cov-3.0.0, clarity-1.0.1
collecting ... collected 5 items

tests/test_checking_guess_valid_word.py::test_reject_long_words PASSED   [ 20%]
tests/test_checking_guess_valid_word.py::test_reject_overly_short_words PASSED [ 40%]
tests/test_checking_guess_valid_word.py::test_reject_nondict_words FAILED [ 60%]
tests/test_checking_guess_valid_word.py::test_accept_dict_words PASSED   [ 80%]
tests/test_pick_word.py::test_pick_word_ok_length PASSED                 [100%]

=================================== FAILURES ===================================
__________________________ test_reject_nondict_words ___________________________

    def test_reject_nondict_words():
        """Scenario: Reject fake words via dictionary"""
        # When guessing "vbpdj"
        guess = "vbpdj"
        is_valid, reject_reason = check_valid_word(guess)
        # Then the guess is rejected
>       assert not is_valid, "Word not in dictionary should have been rejected"
E       AssertionError: Word not in dictionary should have been rejected
E       assert not True

tests/test_checking_guess_valid_word.py:59: AssertionError
- generated xml file: /home/jiby/dev/ws/short/literate_wordle/test_results/results.xml -

----------- coverage: platform linux, python 3.9.5-final-0 -----------
Name                                     Stmts   Miss  Cover
------------------------------------------------------------
src/literate_wordle/__init__.py              1      0   100%
src/literate_wordle/assets/__init__.py       0      0   100%
src/literate_wordle/words.py                19      0   100%
------------------------------------------------------------
TOTAL                                       20      0   100%
Coverage HTML written to dir test_results/coverage.html
Coverage XML written to file test_results/coverage.xml

=========================== short test summary info ============================
FAILED tests/test_checking_guess_valid_word.py::test_reject_nondict_words - A...
========================= 1 failed, 4 passed in 0.11s ==========================
make: *** [Makefile:16: test] Error 1

Progress! Four of five tests pass[fn::Since the remaining two tests we didn’t implement code for each check one of the is_valid boolean, it’s normal that we spuriously pass one of the remaining tests, because our dummy function returns the same boolean answer always, and a broken clock is right twice a day.], so we now need the dictionary.

Note that in Wordle’s original implementation, the list of possible solutions is a subset of the word dictionary used for guess validation. We previously loaded the answers, now we need the larger set of accepted words. While it does mean there will be duplicate entries, we’re talking single-digit kilobytes, we can afford that.

We fetch the dictionary like before:

wget \
    --output-document "src/literate_wordle/assets/wordle_accepted_words_dict.txt" \
    "https://raw.githubusercontent.com/AllValley/WordleDictionary/6f14d2f03d01c36fe66e3ccc0929394251ab139d/wordle_complete_dictionary.txt"

And compress it too

ANSWERS_FILE="src/literate_wordle/assets/wordle_accepted_words_dict.txt"
du -k "${ANSWERS_FILE}"
gzip "$ANSWERS_FILE"
du -k "${ANSWERS_FILE}.gz"
92	src/literate_wordle/assets/wordle_accepted_words_dict.txt
36	src/literate_wordle/assets/wordle_accepted_words_dict.txt.gz

This time is more like two thirds shaved off, sweet.

We reach to add a function for decompressing, but realize we wrote all this before, except for a different filename. So let’s edit the zip extraction code to be more generic.

One way this can be more generic is returning a set of strings, instead of the previous list. This means we assume no ordering and use hash addressing, rather than strict string ordering. After all, we will not iterate through the list, as much as we want to randomly access entries, so the set will provide benefits down the line.

def get_asset_zip_as_set(asset_filename: str) -> set[str]:
    """Decompress a file in assets module into a set of words, separated by newline"""
    compressed_bytes = pkg_resources.read_binary(assets, asset_filename)
    string = gzip.decompress(compressed_bytes).decode("ascii")
    string_list = [word.strip().lower().strip() for word in string.split("\n")]
    return set(string_list)

In order to avoid hardcoded filenames, we yank out the file names and fetching of files:

ANSWERS_FILENAME = "wordle_answers_dict.txt.gz"
ACCEPTED_FILENAME = "wordle_accepted_words_dict.txt.gz"
def get_answers() -> set[str]:
    """Grab the Wordle answers as a set of string words"""
    return get_asset_zip_as_set(ANSWERS_FILENAME)


def get_accepted_words() -> set[str]:
    """Grab the Wordle accepted words dictionary as a set of string words"""
    return get_asset_zip_as_set(ACCEPTED_FILENAME)

And now we can use the dictionary as a set in our check_valid_word function:

"""Check a wordle guess is valid: length and in dictionary"""
answer_length = 5
guess_length = len(guess)
if guess_length < answer_length:
    return False, "Guess too short"
if guess_length > answer_length:
    return False, "Guess too long"
valid_words_dict = get_accepted_words()
if guess in valid_words_dict:
    return True, None
return False, "Not a word from the dictionary"

Small performance note: Having a set of strings means guess in answers_set comparison is O(1) (instead of O(n) on dictionary size for list), because the hash-addressing of set is a O(1) operation. On very very long list of words, iterating through it could be expensive, hence using set for lookup if we don’t need sequential access.

We changes the invocation of pick_answer_word to use the new functions too

def pick_answer_word() -> str:
    """Pick a single word out of the dictionary of answers"""
    return choice(list(get_answers()))

And we’re done! Let’s run our system through make again, to spot test failures but also to get linters:

make
poetry install
Installing dependencies from lock file

No dependencies to install or update

Installing the current project: literate_wordle (0.1.0)
pre-commit run --all --all-files
Emacs export org-mode file to static HTML................................Passed
Trim Trailing Whitespace.................................................Passed
Fix End of Files.........................................................Passed
Check for added large files..............................................Passed
Check that executables have shebangs.................(no files to check)Skipped
Check for case conflicts.................................................Passed
Check vcs permalinks.....................................................Passed
Forbid new submodules....................................................Passed
Mixed line ending........................................................Passed
Check for merge conflicts................................................Passed
Detect Private Key.......................................................Passed
Check Toml...............................................................Passed
Check Yaml...............................................................Passed
Check JSON...........................................(no files to check)Skipped
black....................................................................Passed
isort (python)...........................................................Passed
mypy.....................................................................Passed
flake8...................................................................Passed
cd docs && make html
make[1]: Entering directory '/home/jiby/dev/ws/short/literate_wordle/docs'
Running Sphinx v4.4.0
Read in collections ...
  wordle_html_export_filecopy: Initialised
  gherkin_features_foldercopy: Initialised
  gherkin_features_jinja: Initialised
Clean collections ...
  gherkin_features_foldercopy: (CopyFolderDriver) Folder deleted: /home/jiby/dev/ws/short/literate_wordle/docs/source/_collections/gherkin_features/
  gherkin_features_jinja: (JinjaDriver) Cleaning 1 jinja Based file/s ...
Executing collections ...
  wordle_html_export_filecopy: (CopyFileDriver) Copy file...
  gherkin_features_foldercopy: (CopyFolderDriver) Copy folder...
  gherkin_features_jinja: (JinjaDriver) Creating 1 file/s from Jinja template...
loading pickled environment... done
[autosummary] generating autosummary for: _collections/gherkin_feature.md, index.rst, readme.md, wordle.md, wordle_sources.md
[AutoAPI] Reading files... [ 33%] /home/jiby/dev/ws/short/literate_wordle/src/literate_wordle/__init__.py
[AutoAPI] Reading files... [ 66%] /home/jiby/dev/ws/short/literate_wordle/src/literate_wordle/words.py
[AutoAPI] Reading files... [100%] /home/jiby/dev/ws/short/literate_wordle/src/literate_wordle/assets/__init__.py

[AutoAPI] Mapping Data... [ 33%] /home/jiby/dev/ws/short/literate_wordle/src/literate_wordle/__init__.py
[AutoAPI] Mapping Data... [ 66%] /home/jiby/dev/ws/short/literate_wordle/src/literate_wordle/words.py
[AutoAPI] Mapping Data... [100%] /home/jiby/dev/ws/short/literate_wordle/src/literate_wordle/assets/__init__.py

[AutoAPI] Rendering Data... [ 33%] literate_wordle
[AutoAPI] Rendering Data... [ 66%] literate_wordle.words
[AutoAPI] Rendering Data... [100%] literate_wordle.assets

myst v0.15.2: MdParserConfig(renderer='sphinx', commonmark_only=False, enable_extensions=['dollarmath'], dmath_allow_labels=True, dmath_allow_space=True, dmath_allow_digits=True, dmath_double_inline=False, update_mathjax=True, mathjax_classes='tex2jax_process|mathjax_process|math|output_area', disable_syntax=[], url_schemes=['http', 'https', 'mailto', 'ftp'], heading_anchors=2, heading_slug_func=None, html_meta=[], footnote_transition=True, substitutions=[], sub_delimiters=['{', '}'], words_per_minute=200)
building [mo]: targets for 0 po files that are out of date
building [html]: targets for 5 source files that are out of date
updating environment: 0 added, 7 changed, 0 removed
reading sources... [ 14%] _collections/gherkin_feature
reading sources... [ 28%] autoapi/index
reading sources... [ 42%] autoapi/literate_wordle/assets/index
reading sources... [ 57%] autoapi/literate_wordle/index
reading sources... [ 71%] autoapi/literate_wordle/words/index
reading sources... [ 85%] wordle
reading sources... [100%] wordle_sources

Copying static files for sphinx-needs datatables support.../home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/datatables_loader.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/datatables.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/sphinx_needs_collapse.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/datatables.min.css /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/JSZip-2.5.0/jszip.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/js/buttons.print.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/js/buttons.flash.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/js/buttons.html5.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/js/buttons.colVis.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/js/dataTables.buttons.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/js/buttons.html5.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/css/common.scss /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/css/mixins.scss /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/css/buttons.dataTables.min.css /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Buttons-1.5.1/swf/flashExport.swf /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/DataTables-1.10.16/js/jquery.dataTables.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/DataTables-1.10.16/css/jquery.dataTables.min.css /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/DataTables-1.10.16/images/sort_asc.png /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/DataTables-1.10.16/images/sort_desc_disabled.png /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/DataTables-1.10.16/images/sort_asc_disabled.png /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/DataTables-1.10.16/images/sort_both.png /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/DataTables-1.10.16/images/sort_desc.png /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/ColReorder-1.4.1/js/dataTables.colReorder.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/ColReorder-1.4.1/css/colReorder.dataTables.min.css /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/FixedColumns-3.2.4/js/dataTables.fixedColumns.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/FixedColumns-3.2.4/css/fixedColumns.dataTables.min.css /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Scroller-1.4.4/js/dataTables.scroller.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Scroller-1.4.4/css/scroller.dataTables.min.css /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/FixedHeader-3.1.3/js/dataTables.fixedHeader.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/FixedHeader-3.1.3/css/fixedHeader.dataTables.min.css /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Responsive-2.2.1/js/dataTables.responsive.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/Responsive-2.2.1/css/responsive.dataTables.min.css /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/pdfmake-0.1.32/pdfmake.min.js /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/libs/html/pdfmake-0.1.32/vfs_fonts.js
Copying static files for sphinx-needs custom style support...[ 25%] common.css
Copying static files for sphinx-needs custom style support...[ 50%] /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/css/modern/layouts.css
Copying static files for sphinx-needs custom style support...[ 75%] /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/css/modern/styles.css
Copying static files for sphinx-needs custom style support...[100%] /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/css/modern/modern.css

looking for now-outdated files... none found
pickling environment... done
checking consistency... /home/jiby/dev/ws/short/literate_wordle/docs/source/autoapi/index.rst: WARNING: document isn't included in any toctree
done
preparing documents... done
writing output... [ 12%] _collections/gherkin_feature
writing output... [ 25%] autoapi/index
writing output... [ 37%] autoapi/literate_wordle/assets/index
writing output... [ 50%] autoapi/literate_wordle/index
writing output... [ 62%] autoapi/literate_wordle/words/index
writing output... [ 75%] index
writing output... [ 87%] wordle
writing output... [100%] wordle_sources

/home/jiby/dev/ws/short/literate_wordle/docs/source/_collections/gherkin_feature.md:34: WARNING: Any IDs not assigned for table node
generating indices... genindex py-modindex done
highlighting module code... [ 50%] literate_wordle
highlighting module code... [100%] literate_wordle.words

writing additional pages... search done
copying images... [ 50%] /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/images/feather_svg/arrow-down-circle.svg
copying images... [100%] /home/jiby/dev/ws/short/literate_wordle/.venv/lib/python3.9/site-packages/sphinxcontrib/needs/images/feather_svg/arrow-right-circle.svg

copying static files... done
copying extra files... done
dumping search index in English (code: en)... done
dumping object inventory... done
build succeeded, 2 warnings.

The HTML pages are in build/html.
Final clean of collections ...
  wordle_html_export_filecopy: (CopyFileDriver) File deleted: /home/jiby/dev/ws/short/literate_wordle/docs/source/_collections/_static/wordle.html
  gherkin_features_foldercopy: (CopyFolderDriver) Folder deleted: /home/jiby/dev/ws/short/literate_wordle/docs/source/_collections/gherkin_features/
  gherkin_features_jinja: (JinjaDriver) Cleaning 1 jinja Based file/s ...
  gherkin_features_jinja: (JinjaDriver)   File deleted: /home/jiby/dev/ws/short/literate_wordle/docs/source/_collections/gherkin_feature.md

Checking sphinx-needs warnings
make[1]: Leaving directory '/home/jiby/dev/ws/short/literate_wordle/docs'
poetry run pytest
============================= test session starts ==============================
platform linux -- Python 3.9.5, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/jiby/dev/ws/short/literate_wordle/.venv/bin/python
cachedir: .pytest_cache
rootdir: /home/jiby/dev/ws/short/literate_wordle, configfile: pyproject.toml
plugins: cov-3.0.0, clarity-1.0.1
collecting ... collected 5 items

tests/test_checking_guess_valid_word.py::test_reject_long_words PASSED   [ 20%]
tests/test_checking_guess_valid_word.py::test_reject_overly_short_words PASSED [ 40%]
tests/test_checking_guess_valid_word.py::test_reject_nondict_words PASSED [ 60%]
tests/test_checking_guess_valid_word.py::test_accept_dict_words PASSED   [ 80%]
tests/test_pick_word.py::test_pick_word_ok_length PASSED                 [100%]

- generated xml file: /home/jiby/dev/ws/short/literate_wordle/test_results/results.xml -

----------- coverage: platform linux, python 3.9.5-final-0 -----------
Name                                     Stmts   Miss  Cover
------------------------------------------------------------
src/literate_wordle/__init__.py              1      0   100%
src/literate_wordle/assets/__init__.py       0      0   100%
src/literate_wordle/words.py                23      0   100%
------------------------------------------------------------
TOTAL                                       24      0   100%
Coverage HTML written to dir test_results/coverage.html
Coverage XML written to file test_results/coverage.xml

============================== 5 passed in 0.09s ===============================
poetry build
Building literate_wordle (0.1.0)
  - Building sdist
  - Built literate_wordle-0.1.0.tar.gz
  - Building wheel
  - Built literate_wordle-0.1.0-py3-none-any.whl

Tests pass, coverage stays strong, and linters are quiet, this is great!

Performance trick

We mentioned before that the whole dictionary would get unzipped on every request for assets. Now we’re validating guessed words, we may want to be processing guesses quite often, certainly quicker than one would pick secret words!

What we want to make all this fast, is to cache the unzipped dictionary, so that repeated calls to the function get_asset_zip_as_set don’t bother with file open and unzip, just serve the few hundred kilobytes content from memory again. There’s a handy python decorator that does the trick! Let’s add functools.cache on top of our slow function:

from functools import cache
@cache

After rerunning our tests, we now have a (theoretically) faster function, yey!

Remember that we committed a couple of performance/optimization sins just then, by both: optimizing prematurely (with no proof of slowness), and by doing optimization without using profiling information to optimize, we very likely just optimized something that isn’t our bottleneck. I’m fine with that, I just wanted to showcase this cool decorator, which functions like an unbounded memoizer. Let’s see quick performance numbers of before/after:

poetry run python3 -m timeit -v -n 1000 --setup "from literate_wordle.words import pick_answer_word, check_valid_word" "check_valid_word(pick_answer_word())"
raw times: 2.75 sec, 2.72 sec, 2.73 sec, 2.73 sec, 2.72 sec

:

1000 loops, best of 5: 2.72 msec per loop

And after cacheing:

raw times: 17.1 msec, 12.8 msec, 12.6 msec, 12.8 msec, 12.4 msec

:

1000 loops, best of 5: 12.4 usec per loop

That’s a two orders of magnitude gain for a single line of code changed. Sweet.

Bug!

Doing some exploration of the accepted/answer word sets, I noticed an issue:

from literate_wordle.words import get_answers, get_accepted_words

answer_lengths = [len(word) for word in list(get_answers())]
accepted_lengths = [len(word) for word in list(get_accepted_words())]

print(set(answer_lengths))
print(set(accepted_lengths))
{0, 5}
{0, 5}

Each have a 0-length words, in other word, the empty string.

This is likely a classic issue due to DOS line endings, the last line of the file is only a carriage return, which is technically whitespace, and the call to strip() removes it, leaving an empty space item in the list.

If this was a proper production issue we just discovered, we would first turn the above snippet into a proper test case (asserting no 0 length word exist, seeing it be red), commit that, raise it as bug, and work on a fix. But this code hasn’t reached production yet, and the bug itself is minor enough to not warrant that during our exploration phase.

We can fix this multiple ways. We could make the get_accepted_words and get_answers functions change their behaviours (either via set operations to remove the empty item from the set, returning set(words) - set([""]), or more likely removing empty entries during iteration), but that wouldn’t prevent future users of the buggy function get_asset_zip_as_set to get the same issue.

So let’s fix it at the root, the get_asset_zip_as_set function:

def get_asset_zip_as_set(asset_filename: str) -> set[str]:
    """Decompress a file in assets module into a set of words, separated by newline"""
    compressed_bytes = pkg_resources.read_binary(assets, asset_filename)
    string = gzip.decompress(compressed_bytes).decode("ascii")
    string_list = [word.strip().lower().strip() for word in string.split("\n")]
    # Protect against whitespace-only lines during file-read causing empty stripped word
    non_empty_words = [word for word in string_list if len(word) != 0]
    return set(non_empty_words)

This was a good opportunity to play with List Comprehensions with filters, yey.

Tangle out all the code

The last section of each heading of this document is used for internal purposes. The Code snippets defined above are usually out of order, especially the imports, or functions defined once as stubs, then re-defined with proper implementation.

To avoid having nonsense python file ordering, with import-feature-import-feature sequences, which formatters would go crazy over, we define below the reordered code blocks as they should be output, using the noweb feature of org-mode. This lets us reference code blocks above by name, tangle out into the proper files with proper ordering and spacing as one would expect a real codebase to look like.

This means we need to manually weave the code blocks, instead of pointing them all to the same file and rely on code snippet’s top-to-bottom order, we now have an explicit code block where we template out “add this bit, now 2 lines below add that snippet, and then…”. This isn’t super pretty, but it gives complete control over layout like number of lines jumped between functions, which was blocking adoption of the formatter “black” in this repository.

First, fix words.py imports being out of order in our narrative by tangling them via noweb to weave the part 1 imports with the part 2. This means isort (import sorter[fn::Sorts import-code first by category, then alphabetically. Category of imports is in decreasing order: stdlib, then third party packages, then local module imports.]) is now happy and won’t thrash these python files. Also insert the cache decorator before the assets function, and substitute the check_valid_word function body with the real implementation instead of the dummy function defined initially.

<<choice-module-docstring>>

<<choice-stdlib>>
<<valid-cache-import>>
<<choice-stdlib2>>
<<valid-stdlib>>

<<choice-locallib>>

<<choice-magicstrings>>


<<choice-func-getdicts>>


<<valid-cache-decorator>>
<<choice-func-unzipdict-generic2>>


<<choice-func-pickanswer-generic>>


<<valid-func-proto>>
    <<valid-func-len-dict>>

Now the same thing with the tests file, which indeed is in proper order already, but would benefit from two-lines-between-tests to guarantee formatting:

<<test-valid-import>>


<<test-valid-1>>


<<test-valid-2>>


<<test-valid-3>>


<<test-valid-4>>
<<reject-reason-none>>

Calculating guessed word’s score

We can pick answer words, and we can check if a guess is a valid word, now we have everything we need to score the guess! Let’s first define the overall feature:

Feature: Scoring guesses
  As a Wordle game
  I need to tell the player how good their guess is
  In order to help them find the proper answer

This sounds simple, but implementing this feature is tricky, because of edge cases like multiple identical character in the answer, which need colored appropriately (What’s the proper way to do that? No clue yet, but we need to pin it down in requirements!). So again we’ll define Gherkin Scenarios for that Feature, to give examples of how the feature works in practice. So we write out:

Scenario: Perfect guess gives perfect score
  Given a wordle answer "crane"
  When scoring the guess "crane"
  Then score should be "🟩🟩🟩🟩🟩"

This seems easy enough, but we should notice that we’re assuming the guess is a valid word! We may want to just add another Given, like:

Given a guess that's a valid dictionary word

But this isn’t just a hypothesis from the current scenario, it’s valid for all scenarios of this feature: every scoring of a guess requires the guess to be a valid word. To avoid the tedious copying of that assumption in each Scenario, we can use a Gherkin Background for the feature:

Background:
  Given a guess that's a valid dictionary word

Perfect, so now we’re assuming the guess is a valid word, which means a dependency on having implemented the previous feature, but we’re not specifying the guess word itself, which can still be scenario specific. This makes our initial “perfect guess” scenario valid again, so we can use it

If we’ve got the perfect answer, let’s have the opposite:

Scenario: No character in common
  Given a wordle answer "brave"
  When scoring the guess "skill"
  Then score should be "⬜⬜⬜⬜⬜"

Note that these scenarios doesn’t make assumption of how many attempts at Wordle we’re at, or the fact of winning or losing. This is purely a hypothetical example, disjoint from the actual playing of a Wordle game. We can deal with the win/lose consequences later, once we have a proper scoring of guesses implemented.

Can we start coding yet?

At this point, we can conceivably start the implementation work: “let’s go, we have work to do!” And we can add the “🟨” scenario later once we have code that works.

The problem of “what to do now” is interesting, because we can continue thinking up scenarios in Gherkin for a while, or we could make a start writing test code to match these claims, fix the red tests, implement towards green tests, and add scenarios as we realize that our implementation is lacking compared to the original intent of the game. That can certainly be done!

But while it’s tempting to jump into code first, I strongly believe we as developers should instead fully scope out the problem-space first. Pin down the exact requirements (in that case via Gherkin features and scenarios), before starting to touch any code. My reasoning is that it’s very easy to get tunnel vision when writing code, getting excited about the programming problems, losing track of what the “user” wants. We should instead write down the exact user needs first, and have a proper “ritual” for switching our “User” hat to a “Developer” hat.

Finalizing the scoring scenarios

So, back to our gherkin scenarios, let’s add the yellow marker one:

Scenario: Character in wrong place
  Given a wordle answer "rebus"
  When scoring the guess "skull"
  Then score should be "🟨⬜🟨⬜⬜"

And just for having a good sample of tests with which to test, let’s use a table of examples to confirm scoring works out in more cases:

Scenario Outline: Scoring guesses
  Given a wordle <answer>
  When scoring <guess>
  Then score should be <score>

# Emoji (Unicode) character rendering is hard:
# Please forgive the table column alignment issues!
  Examples: A few guesses and their score
    | answer  | guess	| score		|
    | adage   | adobe	| 🟩🟩⬜⬜🟩	|
    | serif   | quiet	| ⬜⬜🟨🟨⬜	|
    | raise   | radix	| 🟩🟩⬜🟨⬜	|

Note how the “outline” system maps really well to the idea of “parametrized tests”. We can write the test case once, and have a decorator deal with the multiple instantiations with different data.

All right, that’s a few, moving on. But here is the most difficult to implement corner case, written out as examples of the previous scenario:

Examples: Multiple occurences of same character
  | answer | guess	| score		|
  | abbey  | kebab	| ⬜🟨🟩🟨🟨	|
  | abbey  | babes	| 🟨🟨🟩🟩⬜	|
  | abbey  | abyss	| 🟩🟩🟨⬜⬜	|
  | abbey  | algae	| 🟩⬜⬜⬜🟨	|
  | abbey  | keeps	| ⬜🟨⬜⬜⬜	|
  | abbey  | abate	| 🟩🟩⬜⬜🟨	|

Because this edge case was worrisome for accuracy, these sample answers and scores were taken from online example screenshots of the original Wordle website, thus considered accurate references.

Thinking about it, with “abbey” as reference, the “kebab” answer seems logical, with first “b” occurence matching as green, and the second being in the wrong place. The surprise comes from “keeps” where the first “e” counts, but the second doesn’t have an equivalent in the answer, hence flagged as “no such character”. That makes sense, but that’s not how a naive implementation of the game would do it! Hence why it’s worth thinking about the full problem before rushing the implementation.

Out of curiosity, I wonder if there’s any wordle answers that contain three identical characters? Let’s see!

zgrep -i -E "([a-z]).*\1.*\1" \
    src/literate_wordle/assets/wordle_answers_dict.txt.gz \
    | wc -l
20

Really? 20? That’s harsh … show me one?

zgrep -i -E "([a-z]).*\1.*\1" \
    src/literate_wordle/assets/wordle_answers_dict.txt.gz \
    | head -n 1 \
    | sed 's/\r//'  # gets rid of CR characters in CRLF (DOS line endings)
bobby

Interesting. That must be hard to solve I imagine.

Writing up acceptance tests

With no more obvious pathological cases to cover in requirements, it’s time to switch to our developer hat, and write some (acceptance) tests!

def test_perfect_guess():
    """Scenario: Perfect guess gives perfect score"""
    # Given a wordle answer "crane"
    answer = "crane"
    # When scoring the guess "crane"
    our_guess = "crane"
    score = score_guess(our_guess, answer)
    # Then score should be "🟩🟩🟩🟩🟩"
    assert score == "🟩🟩🟩🟩🟩", "Perfect answer should give Perfect Score"

A score_guess function? sounds reasonable. We’ll need to import it from a module…

from literate_wordle.guess import score_guess

This means we now need to create such a module.

"""Score guesses of Wordle game"""

We already defined most of the function (name, module, output), so let’s just write a stub that will make tests go red.

def score_guess(guess: str, answer: str) -> str:
    """Score an individual guess"""
    return "⬜"

Now the test should fail appropriately, let’s add a twist: we’ll mark the test function as expected to fail, because for now it’s not been implemented. This allows the test runner to mark all tests as OK despite known failures, and is perfect for known bugs being worked on, or new features being built. Imagine if every time we built new features via TDD, the commit that adds the test first makes CI go red! No, we would rather have a nice “excuse” for this new test to fail, and have the build stay green, “with an expected failure”.

@pytest.mark.xfail(reason="Not implemented yet")

In the case of a known bug, the reason field would very likely be a bug identifier in the organisation’s bug tracker.

import pytest

Confirm these tests work, marked as xfail (“eXpected FAILure”):

make test
poetry run pytest
============================= test session starts ==============================
platform linux -- Python 3.9.5, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/jiby/dev/ws/short/literate_wordle/.venv/bin/python
cachedir: .pytest_cache
rootdir: /home/jiby/dev/ws/short/literate_wordle, configfile: pyproject.toml
plugins: cov-3.0.0, clarity-1.0.1
collecting ... collected 6 items

tests/test_checking_guess_valid_word.py::test_reject_long_words PASSED   [ 16%]
tests/test_checking_guess_valid_word.py::test_reject_overly_short_words PASSED [ 33%]
tests/test_checking_guess_valid_word.py::test_reject_nondict_words PASSED [ 50%]
tests/test_checking_guess_valid_word.py::test_accept_dict_words PASSED   [ 66%]
tests/test_pick_word.py::test_pick_word_ok_length PASSED                 [ 83%]
tests/test_scoring_guess.py::test_perfect_guess XFAIL (Not implement...) [100%]

- generated xml file: /home/jiby/dev/ws/short/literate_wordle/test_results/results.xml -

----------- coverage: platform linux, python 3.9.5-final-0 -----------
Name                                     Stmts   Miss  Cover
------------------------------------------------------------
src/literate_wordle/__init__.py              1      0   100%
src/literate_wordle/assets/__init__.py       0      0   100%
src/literate_wordle/guess.py                 2      0   100%
src/literate_wordle/words.py                25      0   100%
------------------------------------------------------------
TOTAL                                       28      0   100%
Coverage HTML written to dir test_results/coverage.html
Coverage XML written to file test_results/coverage.xml

========================= 5 passed, 1 xfailed in 0.10s =========================

Note that we now have regular tests that pass, and this one test that fails as expected, and pytest, expecting it, doesn’t shout about the failure. Really handy.

Remember that “disabling” (marking as pytest.mark.skip) is different from marking as xfail, because skipping a test avoids running it, while xfail tests do run, the assertion failure is just not marked as critical. There’s even a flag to make xpass (expected test failures that ended up being green) become an actual fatal testing error, for the cases where it’s important to track the failure itself.

More tests

Let’s implement the rest of the failing tests, so we can make it all red, then fix the implementation:

def test_no_common_character():
    """Scenario: No character in common"""
    # Given a wordle answer "brave"
    answer = "brave"
    # When scoring the guess "skill"
    our_guess = "skill"
    score = score_guess(our_guess, answer)
    # Then score should be "⬜⬜⬜⬜⬜"
    assert score == "⬜⬜⬜⬜⬜", "No character in common with answer should give 0 score"
def test_wrong_place():
    """Scenario: Character in wrong place"""
    # Given a wordle answer "rebus"
    answer = "rebus"
    # When scoring the guess "skull"
    our_guess = "skull"
    score = score_guess(our_guess, answer)
    # Then score should be "🟨⬜🟨⬜⬜"
    assert score == "🟨⬜🟨⬜⬜", "Characters are in the wrong place"

That covers the first three scenarios.

For the Scenario Outline, it’s interesting to notice that a pattern emerged, which allows the same test skeleton to be reused with different data. In Pytest, this can be done by “parametrizing” the test with multiple data entries.

This is a decorator to flag data, but since we’re trying to group some of those tests as part of different groups, we will use the pytest.param.id flag.

def test_generic_score(answer, our_guess, expected_score):
    """Scenario Outline: Scoring guesses"""
    # Given a wordle <answer>
    # When scoring <guess>
    score = score_guess(our_guess, answer)
    # Then score should be <score>
    assert score == expected_score

Just need to fill in the parameters:

@pytest.mark.parametrize(
    "answer,our_guess,expected_score",
    [
        pytest.param("adage", "adobe", "🟩🟩⬜⬜🟩", id="normal_guess1"),
        pytest.param("serif", "quiet", "⬜⬜🟨🟨⬜", id="normal_guess2"),
        pytest.param("raise", "radix", "🟩🟩⬜🟨⬜", id="normal_guess3"),
        pytest.param("abbey", "kebab", "⬜🟨🟩🟨🟨", id="multi_occur1"),
        pytest.param("abbey", "babes", "🟨🟨🟩🟩⬜", id="multi_occur2"),
        pytest.param("abbey", "abyss", "🟩🟩🟨⬜⬜", id="multi_occur3"),
        pytest.param("abbey", "algae", "🟩⬜⬜⬜🟨", id="multi_occur4"),
        pytest.param("abbey", "keeps", "⬜🟨⬜⬜⬜", id="multi_occur5"),
        pytest.param("abbey", "abate", "🟩🟩⬜⬜🟨", id="multi_occur6"),
    ],
)

Implementing the feature

With the strong test harness we have, this scoring function can be done conveniently.

Let’s experiment with the solution, iterating over naive solution and seeing how close they get to implementing the feature, by number of tests failed. This isn’t required, we have already identified edge cases that make naive solutions break, but this is the fun experimenting part.

Before any actual code change, first we remove the “xfail” marker, so that test failures actually notify us as failures, as we’re actually implementing things now.

def score_guess(guess: str, answer: str) -> str:
    """Score an individual guess naively"""
    NO = "⬜"
    OK = "🟩"
    response = ""
    for answer_char, guess_char in zip(answer, guess):
        if answer_char == guess_char:
            response += OK
        else:
            response += NO
    return response

That only passes 3 tests of the 12 we just defined, obviously because we don’t deal with incorrect characters at all. So let’s add keeping track of characters in the wrong places:

def score_guess(guess: str, answer: str) -> str:
    """Score an individual guess a little less naively"""
    NO = "⬜"
    OK = "🟩"
    WRONG_PLACE = "🟨"
    answer_chars_set = set(list(answer))
    response = ""
    for answer_char, guess_char in zip(answer, guess):
        if answer_char == guess_char:
            response += OK
        elif guess_char in answer_chars_set:
            response += WRONG_PLACE
        else:
            response += NO
    return response

That version now passes 8 of 12 tests, with the issue being the multiple occurence of the same character in the answer being treated wrong, clearly an edge case we were fortunate to identify early.

Looking at the examples, it seems that our scoring function needs to keep track of how many occurences of each characters of the answer exists overall, and grade only the first occurence of such characters as “wrong place”, reducing the counter.

Fortunately, Python implements a good Counter function which we can import:

from collections import Counter

We want something like this:

if guess_char in answer_chars and answer_chars[guess_char] > 0:
    response += WRONG_PLACE
    # Reduce occurence since we "used" this one
    answer_chars[guess_char] -= 1
    # No more hits = pretend character isn't even seen (remove from dict)
    if answer_chars[guess_char] == 0:
        del answer_chars[guess_char]

So we try the Counter way

def score_guess(guess: str, answer: str) -> str:
    """Score an individual guess with Counter"""
    NO = "⬜"
    OK = "🟩"
    WRONG_PLACE = "🟨"
    # Counter("abbey") = Counter({'b': 2, 'a': 1, 'e': 1, 'y': 1})
    answer_chars = Counter(answer)
    response = ""
    for answer_char, guess_char in zip(answer, guess):
        if answer_char == guess_char:
            response += OK
        elif guess_char in answer_chars and answer_chars[guess_char] > 0:
            response += WRONG_PLACE
            # Reduce occurence since we "used" this one
            answer_chars[guess_char] -= 1
            # No more hits = pretend character isn't even seen (remove from dict)
            if answer_chars[guess_char] == 0:
                del answer_chars[guess_char]
        else:
            response += NO
    return response

But while this improves the score, we are still 3 tests from success! Turns out we only did the reduction of counter for yellow, not also greens. This needs a bit of reshuffling:

def score_guess(guess: str, answer: str) -> str:
    """Score an individual guess with Counter"""
    NO = "⬜"
    OK = "🟩"
    WRONG_PLACE = "🟨"
    # Counter("abbey") = Counter({'b': 2, 'a': 1, 'e': 1, 'y': 1})
    answer_chars = Counter(answer)
    response = ""
    for guess_char, answer_char in zip(guess, answer):
        if guess_char not in answer_chars:
            response += NO
            continue  # Early exit for this character, skip to next
        # From here on, we MUST have a char in common, regardless of place
        if answer_char == guess_char:
            response += OK
        elif answer_chars[guess_char] > 0:
            response += WRONG_PLACE
        # Either way, reduce occurence counter since we "used" this occurence
        answer_chars[guess_char] -= 1
        # No more hits = pretend character isn't even seen (remove from dict)
        if answer_chars[guess_char] == 0:
            del answer_chars[guess_char]
    return response

Now that we’re happy with this, we can refactor out the ugly hardcoded glyphs:

class CharacterScore(str, Enum):
    """A single character's score"""

    OK = "🟩"
    NO = "⬜"
    WRONG_PLACE = "🟨"
from enum import Enum

And to use it as part of our scoring function:

def score_guess(guess: str, answer: str) -> str:
    """Score an individual guess with Counter"""
    # Counter("abbey") = Counter({'b': 2, 'a': 1, 'e': 1, 'y': 1})
    answer_chars = Counter(answer)
    response = ""
    for guess_char, answer_char in zip(guess, answer):
        if guess_char not in answer_chars:
            response += CharacterScore.NO
            continue  # Early exit for this character, skip to next
        # From here on, we MUST have a char in common, regardless of place
        if answer_char == guess_char:
            response += CharacterScore.OK
        elif answer_chars[guess_char] > 0:
            response += CharacterScore.WRONG_PLACE
        # Either way, reduce occurence counter since we "used" this occurence
        answer_chars[guess_char] -= 1
        # No more hits = pretend character isn't even seen (remove from dict)
        if answer_chars[guess_char] == 0:
            del answer_chars[guess_char]
    return response

Tangle it all out

As before, we reorder the blocks from snippets above to export code in a way that keeps proper formatting.

<<scoring-guessmod-header>>


<<scoring-guessfunc-import>>
<<scoring-guess-enum-import>>


<<scoring-guess-enum>>


<<scoring-guessfunc-impl2>>
"""Validates the Gherkin file features/scoring_guess.feature:

<<scoring-feature>>
"""

<<scoring-test-import-pytest>>

<<scoring-test-import>>


<<scoring-test1>>


<<scoring-test2>>


<<scoring-test3>>


<<scoring-multi-parameters>>
<<scoring-multi-skeleton>>

Playing a round of Wordle

With all the subfeatures we have, we can now play a round of wordle, we’re missing only the “state” of the game board, with the interactivity of the game.

Feature: Track number of guesses
  As a Wordle game
  I need to track how many guesses were already given
  In order to announce win or game over

There are a few obvious cases we want to see:

Scenario: First guess is allowed
  Given a wordle answer
  And I didn't guess before
  When I guess the word
  Then my guess is scored
Scenario: Sixth guess still allowed
  Given a wordle answer
  And I guessed 5 times
  When I guess the word
  Then my guess is scored
Scenario: Six failed guess is game over
  Given a wordle answer
  And I guessed 6 times already
  When I guess the word
  And my guess isn't the answer
  Then my guess is scored
  But game shows "Game Over"
  And game shows the real answer

This feature shows us all the state we need to manage to track a Wordle game:

  • an answer
  • the number of previous guesses
  • the previous guesses themselves? not needed after we print them
  • the previous guesses’ scores? not needed after we print it either

So a Wordle Game is the aggregate of “answer” + “number of guesses”, nothing else.

Let’s write the test:

"""Validates the Gherkin file features/track_guesses.feature

<<track-guess-feat2>>
"""
def test_first_guess_allowed():
    """Scenario: First guess is allowed"""
    # Given a wordle answer
    answer = "orbit"
    # And I didn't guess before
    guess_number = 0
    game = WordleGame(answer=answer, guess_number=guess_number)
    # When I guess the word
    guess = "kebab"
    result = play_round(guess, game)
    # Then my guess is scored
    OUTCOME_CONTINUE = WordleMoveOutcome.GUESS_SCORED_CONTINUE
    assert result.outcome == OUTCOME_CONTINUE, "Game shouldn't be over yet"
    assert result.score is not None, "No score given as result"
    assert len(result.score) == 5, "Score of incorrect length"
    ALLOWED_CHARS = [score.value for score in Score]
    assert all(
        char in ALLOWED_CHARS for char in list(result.score)
    ), "Score doesn't match score's characters"

In the test above, I’ve done quite a bit of world-building:

  • Used a new WordleGame structure keeping game state
  • Used a new WordleMoveOutcome enumeration to describe outcomes
  • Used a new play_round function that takes a game + guess
  • Implied in result variable at a structure for new Game state after a move
from literate_wordle.game import WordleGame, WordleMoveOutcome, play_round
from literate_wordle.guess import CharacterScore as Score

This practice of calling an API that doesn’t exist yet is the coolest part of TDD, because the tests lend their power to help design what the software should feel like, even if we have no idea how to create the backend to that API yet. The focus on how the feature is used changes from the usual engineering mindset of how we envision the backend, very valuable.

All right, so with that in mind, let’s start actually building these data structures.

class WordleMoveOutcome(Enum):
    """Outcome of a single move"""

    GAME_OVER_LOST = 1
    GAME_WON = 2
    GUESS_SCORED_CONTINUE = 3
@dataclass
class WordleGame:
    """A Wordle game's internal state, before a move is played"""

    answer: str
    guess_number: int


@dataclass
class WordleMove:
    """A Wordle game state once a move is played"""

    game: WordleGame
    outcome: WordleMoveOutcome
    message: str
    score: Optional[str]
from dataclasses import dataclass
from enum import Enum
from typing import Optional

With the datastructures ready, we can define our function’s signature:

def play_round(guess: str, game: WordleGame) -> WordleMove:
    """Use guess on the given game, resulting in WordleMove"""

Before we finish implementing this function, let’s define the rest of the acceptance tests we settled on in Gherkin:

def test_sixth_guess_allowed():
    """Scenario: Fifth guess still allowed"""
    # Given a wordle answer
    answer = "orbit"
    # And I guessed 5 times
    guess_number = 6
    game = WordleGame(answer=answer, guess_number=guess_number)
    # When I guess the word
    guess = "kebab"
    result = play_round(guess, game)
    # Then my guess is scored
    OUTCOME_CONTINUE = WordleMoveOutcome.GUESS_SCORED_CONTINUE
    assert result.outcome == OUTCOME_CONTINUE, "Game shouldn't be over yet"
    assert result.score is not None, "No score given as result"
    assert len(result.score) == 5, "Score of incorrect length"
    OK_CHARS = ["🟩", "🟨", "⬜"]
    assert all(
        char in OK_CHARS for char in list(result.score)
    ), "Score doesn't match score's characters"
def test_seventh_guess_fails_game():
    """Scenario: Sixth failed guess is game over"""
    # Given a wordle answer
    answer = "orbit"
    # And I guessed 6 times already
    # Guessing 6 times BEFORE, using seventh now:
    guess_number = 7
    game = WordleGame(answer, guess_number)
    # When I guess the word
    # And my guess isn't the answer
    guess = "kebab"
    result = play_round(guess, game)
    # Then my guess isn't scored
    assert result.outcome == WordleMoveOutcome.GAME_OVER_LOST, "Should have lost game"
    # But game shows "Game Over"
    assert "game over" in result.message.lower(), "Should show game over message"
    # And game shows the real answer
    assert answer in result.message

As I write the test in Listing track-guess-test3, I notice there’s one case of the enum we haven’t covered(WordleMoveOutcome.GAME_WON), which means the play_round scenarios aren’t correct yet. Let’s add the scenario for winning the game!

Scenario: Winning guess
  Given a wordle answer
  And I guessed 3 times
  When I guess the word
  And my guess is the answer
  Then my guess is scored
  And score is perfect
  And game shows "Game Won"

A little thought later, it seems we mixed up the requirements a little here (it happens!). When designing the Gherkin Feature, we wrote about exhausting the amounts of guesses, we weren’t thinking of win/lose conditions. But when writing a play_round function, it’s indeed very relevant, especially since the existing scenarios covered most of the cases already. Ideally, we could have added a separate Feature describing winning and losing, and dealt with it separately. In practice, here, it’s simpler to just expand the feature’s scope, even if it means the scope has creeped out a little. This is what real engineering is about, aiming for perfection, but making compromises to match our imperfect world where deadlines and tired developers exist.

Let’s fill in our winning case test:

def test_winning_guess_wins():
    """Scenario: Winning guess"""
    # Given a wordle answer
    answer = "orbit"
    # And I guessed 3 times
    guess_number = 3
    game = WordleGame(answer, guess_number)
    # When I guess the word
    # And my guess is the answer
    guess = answer
    result = play_round(guess, game)
    # Then my guess is scored
    assert result.score is not None, "Guess should be scored"
    # And the score is perfect
    assert result.score == "🟩🟩🟩🟩🟩"
    # And game shows "Game Won
    assert result.outcome == WordleMoveOutcome.GAME_WON, "Should have won game"
    assert "game won" in result.message.lower()

With all the tests ready, we cobble together a stub for play_round to execute the tests and see them go red.

result = WordleMoveOutcome.GAME_OVER_LOST
return WordleMove(game=game, outcome=result, message="You suck!", score=None)

All right, the tests do fail, right?

poetry run pytest 2>&1 || true
============================= test session starts ==============================
platform linux -- Python 3.9.5, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/jiby/dev/ws/short/literate_wordle/.venv/bin/python
cachedir: .pytest_cache
rootdir: /home/jiby/dev/ws/short/literate_wordle, configfile: pyproject.toml
plugins: cov-3.0.0, clarity-1.0.1
collecting ... collected 21 items

tests/test_checking_guess_valid_word.py::test_reject_long_words PASSED   [  4%]
tests/test_checking_guess_valid_word.py::test_reject_overly_short_words PASSED [  9%]
tests/test_checking_guess_valid_word.py::test_reject_nondict_words PASSED [ 14%]
tests/test_checking_guess_valid_word.py::test_accept_dict_words PASSED   [ 19%]
tests/test_pick_word.py::test_pick_word_ok_length PASSED                 [ 23%]
tests/test_scoring_guess.py::test_perfect_guess PASSED                   [ 28%]
tests/test_scoring_guess.py::test_no_common_character PASSED             [ 33%]
tests/test_scoring_guess.py::test_wrong_place PASSED                     [ 38%]
tests/test_scoring_guess.py::test_generic_score[normal_guess1] PASSED    [ 42%]
tests/test_scoring_guess.py::test_generic_score[normal_guess2] PASSED    [ 47%]
tests/test_scoring_guess.py::test_generic_score[normal_guess3] PASSED    [ 52%]
tests/test_scoring_guess.py::test_generic_score[multi_occur1] PASSED     [ 57%]
tests/test_scoring_guess.py::test_generic_score[multi_occur2] PASSED     [ 61%]
tests/test_scoring_guess.py::test_generic_score[multi_occur3] PASSED     [ 66%]
tests/test_scoring_guess.py::test_generic_score[multi_occur4] PASSED     [ 71%]
tests/test_scoring_guess.py::test_generic_score[multi_occur5] PASSED     [ 76%]
tests/test_scoring_guess.py::test_generic_score[multi_occur6] PASSED     [ 80%]
tests/test_track_guess_number.py::test_first_guess_allowed FAILED        [ 85%]
tests/test_track_guess_number.py::test_sixth_guess_allowed FAILED        [ 90%]
tests/test_track_guess_number.py::test_seventh_guess_fails_game FAILED     [ 95%]
tests/test_track_guess_number.py::test_winning_guess_wins FAILED         [100%]

=================================== FAILURES ===================================
___________________________ test_first_guess_allowed ___________________________

    def test_first_guess_allowed():
        """Scenario: First guess is allowed"""
        # Given a wordle answer
        answer = "orbit"
        # And I didn't guess before
        guess_number = 0
        game = WordleGame(answer=answer, guess_number=guess_number)
        # When I guess the word
        guess = "kebab"
        result = play_round(guess, game)
        # Then my guess is scored
        OUTCOME_CONTINUE = WordleMoveOutcome.GUESS_SCORED_CONTINUE
>       assert result.outcome == OUTCOME_CONTINUE, "Game shouldn't be over yet"
E       AssertionError: Game shouldn't be over yet
E       assert == failed. [pytest-clarity diff shown]
E
E         LHS vs RHS shown below
E
E         <WordleMoveOutcome.GAME_OVER_LOST: 1>
E         <WordleMoveOutcome.GUESS_SCORED_CONTINUE: 3>
E

tests/test_track_guess_number.py:25: AssertionError
___________________________ test_sixth_guess_allowed ___________________________

    def test_sixth_guess_allowed():
        """Scenario: Sixth guess still allowed"""
        # Given a wordle answer
        answer = "orbit"
        # And I guessed 5 times
        guess_number = 6
        game = WordleGame(answer=answer, guess_number=guess_number)
        # When I guess the word
        guess = "kebab"
        result = play_round(guess, game)
        # Then my guess is scored
        OUTCOME_CONTINUE = WordleMoveOutcome.GUESS_SCORED_CONTINUE
>       assert result.outcome == OUTCOME_CONTINUE, "Game shouldn't be over yet"
E       AssertionError: Game shouldn't be over yet
E       assert == failed. [pytest-clarity diff shown]
E
E         LHS vs RHS shown below
E
E         <WordleMoveOutcome.GAME_OVER_LOST: 1>
E         <WordleMoveOutcome.GUESS_SCORED_CONTINUE: 3>
E

tests/test_track_guess_number.py:46: AssertionError
_________________________ test_seventh_guess_fails_game _________________________

    def test_seventh_guess_fails_game():
        """Scenario: Sixth failed guess is game over"""
        # Given a wordle answer
        answer = "orbit"
        # And I guessed 6 times already
        # Guessing 6 times BEFORE, using seventh now:
        guess_number = 7
        game = WordleGame(answer, guess_number)
        # When I guess the word
        # And my guess isn't the answer
        guess = "kebab"
        result = play_round(guess, game)
        # Then my guess isn't scored
        assert result.outcome == WordleMoveOutcome.GAME_OVER_LOST, "Should have lost game"
        # But game shows "Game Over"
>       assert "game over" in result.message.lower(), "Should show game over message"
E       AssertionError: Should show game over message
E       assert in failed. [pytest-clarity diff shown]
E
E         LHS vs RHS shown below
E
E         game over
E         you suck!
E

tests/test_track_guess_number.py:69: AssertionError
___________________________ test_winning_guess_wins ____________________________

    def test_winning_guess_wins():
        """Scenario: Winning guess"""
        # Given a wordle answer
        answer = "orbit"
        # And I guessed 3 times
        guess_number = 3
        game = WordleGame(answer, guess_number)
        # When I guess the word
        # And my guess is the answer
        guess = answer
        result = play_round(guess, game)
        # Then my guess is scored
>       assert result.score is not None, "Guess should be scored"
E       AssertionError: Guess should be scored
E       assert is not failed. [pytest-clarity diff shown]
E
E         LHS vs RHS shown below
E
E         None
E

tests/test_track_guess_number.py:86: AssertionError
- generated xml file: /home/jiby/dev/ws/short/literate_wordle/test_results/results.xml -

----------- coverage: platform linux, python 3.9.5-final-0 -----------
Name                                     Stmts   Miss  Cover
------------------------------------------------------------
src/literate_wordle/__init__.py              1      0   100%
src/literate_wordle/assets/__init__.py       0      0   100%
src/literate_wordle/game.py                 20      0   100%
src/literate_wordle/guess.py                19      0   100%
src/literate_wordle/words.py                25      0   100%
------------------------------------------------------------
TOTAL                                       65      0   100%
Coverage HTML written to dir test_results/coverage.html
Coverage XML written to file test_results/coverage.xml

=========================== short test summary info ============================
FAILED tests/test_track_guess_number.py::test_first_guess_allowed - Assertion...
FAILED tests/test_track_guess_number.py::test_sixth_guess_allowed - Assertion...
FAILED tests/test_track_guess_number.py::test_sixth_guess_fails_game - Assert...
FAILED tests/test_track_guess_number.py::test_winning_guess_wins - AssertionE...
========================= 4 failed, 17 passed in 0.18s =========================

All right, let’s implement this.

Implementing the feature

First, if we have too many guesses already (before this one), we return game lost. This means we decide to fail not at the end of the failed sixth guess, but beginning of the seventh.

if game.guess_number >= 7:
    message = f"Too many guesses: Game Over. Answer was: {game.answer}"
    outcome = WordleMoveOutcome.GAME_OVER_LOST
    return WordleMove(game=game, outcome=outcome, message=message, score=None)

In order to count a guess, it needs to be a valid word. This means importing some of our package’s functions.

from literate_wordle.guess import score_guess
from literate_wordle.words import check_valid_word

As we write the code to check if guess is valid word, we notice that if the word isn’t valid, we can’t return GUESS_SCORED_CONTINUE, because an invalid-word guess shouldn’t be counted against the player! So we again revise the WordleMoveOutcome enum and because it’s a new enum case, we will need to add a test for it to cover all grounds! Let’s put a pin in that, finish implementing this first.

GUESS_NOTVALID_CONTINUE = 4

To compensate for having this enum defined all out of order, we’ll use again the noweb feature to weave code back in the enum, in the subsection below, inserting this fourth possibility in the correct place, so the code looks like it should.

valid, validity_msg = check_valid_word(guess)
if not valid and validity_msg is not None:
    outcome = WordleMoveOutcome.GUESS_NOTVALID_CONTINUE
    return WordleMove(game=game, outcome=outcome, message=validity_msg, score=None)

Now we’ve gotten rid of the cases where the guess was invalid.

# Guess now guaranteed to be valid: count it
game.guess_number += 1
score = score_guess(guess, game.answer)
if score == "🟩🟩🟩🟩🟩":
    outcome = WordleMoveOutcome.GAME_WON
    message = f"Correct! Game won in {game.guess_number - 1} guesses"
    return WordleMove(game=game, outcome=outcome, message=message, score=score)

Hmm, but wouldn’t it be nice to avoid this hardcoded blob? Let’s extend the CharacterScore to give this.

@classmethod
@property
def perfect_score(cls) -> str:
    """All-good Wordle score for perfect guess"""
    return "".join([cls.OK] * 5)
if score == CharacterScore.perfect_score:
    outcome = WordleMoveOutcome.GAME_WON
    message = f"Correct! Game won in {game.guess_number - 1} guesses"
    return WordleMove(game=game, outcome=outcome, message=message, score=score)
from literate_wordle.guess import CharacterScore, score_guess
from literate_wordle.words import check_valid_word
# Only case left is "try another guess"
outcome = WordleMoveOutcome.GUESS_SCORED_CONTINUE
message = f"Try again! Guess number {game.guess_number - 1}. Score is: {score}"
return WordleMove(game=game, outcome=outcome, message=message, score=score)

Note that throughout this codebase, we made a lot of assumptions and repetitions around the length of a Wordle answer/guess, and this translate to repeated hardcoded-ness like above regarding emojis. These could have been addressed right away during implementation, and indeed we did, but it’s important to consider if the scope increase is worth it: generalized Wordle to N characters isn’t super interesting to me, as it would require cutting new dictionaries, etc, and I’m just not that into Wordle. This is the kind of technical design decision we can do by having a firm grasp on project scope, another advantage of deep understanding of project requirements.

Back to the implementation: tests should all pass now, make is happy, but there’s an interesting issue:

----------- coverage: platform linux, python 3.9.5-final-0 -----------
Name                                     Stmts   Miss  Cover
------------------------------------------------------------
src/literate_wordle/__init__.py              1      0   100%
src/literate_wordle/assets/__init__.py       0      0   100%
src/literate_wordle/game.py                 38      2    95%
src/literate_wordle/guess.py                19      0   100%
src/literate_wordle/words.py                25      0   100%
------------------------------------------------------------
TOTAL                                       83      2    98%
Coverage HTML written to dir test_results/coverage.html
Coverage XML written to file test_results/coverage.xml

We lowered coverage, nooo! Exploring the coverage HTML file in a browser, we see that the lines in question that aren’t covered are:

if not valid and validity_msg is not None:
    outcome = WordleMoveOutcome.GUESS_NOTVALID_CONTINUE
    return WordleMove(game=game, outcome=outcome, message=validity_msg, score=None)

Oh! That’s the test case we put a pin in! Right, so we’re back to writing that test. I wonder if we should write a whole scenario to back it up? It’s not really obvious!

If this test case spins out of an edge case of our implementation, it’s not really coming from a business requirement, so it’s probably not worth writing a Gherkin Scenario alongside the other ones. If it is indeed an overlooked requirement, then yes, add it to the requirements pile and write a feature.

Hmm, let’s write the test first, and see if the scenario that emerges is a requirement.

def test_invalid_guess_not_counted():
    """Scenario: Invalid guess isn't counted"""
    # Given a wordle answer
    answer = "orbit"
    # And I guessed 3 times
    guess_number = 3
    game = WordleGame(answer=answer, guess_number=guess_number)
    # When I guess the word
    # But my guess isn't a dictionary word
    guess = "xolfy"
    result = play_round(guess, game)
    # Then my guess is rejected as invalid word
    OUTCOME_BADWORD = WordleMoveOutcome.GUESS_NOTVALID_CONTINUE
    assert result.outcome == OUTCOME_BADWORD, "Guess should have been rejected"
    # And my guess is not scored
    assert result.score is None, "No score should be given on bad word"

Hmm, after some thought, it seems that the function we implemented, compared to the feature being described in Gherkin, is indeed different!

As mentioned before, the Gherkin feature was about tracking specific number of guesses, but we increased scope to consider the wider win scenario, using the “play round” feature. Expanding the feature again to cover more cases than just how many guesses, it needs to understand if the guess is correct word or not.

So for the specific purpose of tracking guesses as a feature, we’re already covered by existing scenarios. But not only are we missing edge cases of implementation, as we saw in coverage metrics, but this is the wider feature that a play a round Feature would cover.

This game’s implementation being so very near completion, I am not interested in creating another feature file, I’ll just expand a bit the original feature to be about being able to play a whole round, wins and losses included, just to keep this narrative barely on track.

Feature: Playing a round
  As a Wordle game
  I need to track how many guesses were already given, stating wins/losses
  In order to play the game
Scenario: Invalid guess isn't counted
  Given a wordle answer
  And I guessed 3 times
  When I guess the word
  But my guess isn't a dictionary word
  Then my guess is rejected as invalid word
  And my guess is not scored

And with this new test, we’re back to passing tests and 100% coverage!

Tangling out the whole thing

The feature first:

<<track-guess-feat2>>

<<track-guess-scenario1>>

<<track-guess-scenario2>>

<<track-guess-scenario3>>

<<track-guess-scenario4>>

<<track-guess-scenario5>>

The tests:

<<track-guess-test-docs>>


<<track-guess-test-import>>


<<track-guess-test1>>


<<track-guess-test2>>


<<track-guess-test3>>


<<track-guess-test4>>


# Case covered by existing gherkin feature:
# Intentional, see wordle.org for reasoning
<<track-guess-test5>>
"""Wordle game's state and playing rounds"""


<<track-guess-import-dataclass>>

<<track-guess-import-module>>


<<track-guess-gamestate1>>
    <<track-guess-enum4>>


<<track-guess-gamestate2>>


<<track-guess-proto>>
    <<track-guess-impl1>>
    <<track-guess-impl2>>
    <<track-guess-impl3>>
    <<track-guess-impl4>>
    <<track-guess-impl5>>
    <<track-guess-impl6>>

And remember that we had to expand the CharacterScore, so we need to re-tangle it here:

<<scoring-guessmod-header>>


<<scoring-guessfunc-import>>
<<scoring-guess-enum-import>>


<<scoring-guess-enum>>

    <<track-guess-perfectscore>>


<<scoring-guessfunc-impl2>>

Final round: command line interface

We have assembled lego bricks into an almost finished product, as we have enough to play a single round. Let’s give this project a shell command to invoke, tying together all the other disjointed features.

Feature: Pywordle shell command
  As a Wordle game
  I need a shell command to launch the game
  In order to give convenient entrypoint for players

I don’t think it’s necessary to give specific scenarios, because we’ve thoroughly tested the underlying implementation of the game, we just need to assemble it into a shell command.

So let’s define an entrypoint for the game, generating a new one:

def new_game() -> WordleGame:
    """Generate a new WordleGame"""
    return WordleGame(answer=pick_answer_word(), guess_number=1)

And how to play until we lose, printing to stdout as we go:

def play_game(game: WordleGame, guess_fetcher: Callable, response_logger: Callable):
    """Plays the given WordleGame until completion.

    Asks guess_fetcher for guess, and sends response to response_logger
    """
    outcome = WordleMoveOutcome.GUESS_SCORED_CONTINUE  # Gotta start somehow
    while outcome not in {WordleMoveOutcome.GAME_WON, WordleMoveOutcome.GAME_OVER_LOST}:
        guess = guess_fetcher()
        result = play_round(guess=guess, game=game)
        response_logger(result.message)
        game = result.game
        outcome = result.outcome

Pepper in the few imports we need:

from typing import Callable
from literate_wordle.game import WordleGame, WordleMoveOutcome, play_round
from literate_wordle.words import pick_answer_word

Now we can add command line argument parsing in a separate file:

def parse_args(raw_args: Optional[Sequence[str]] = None) -> argparse.Namespace:
    """Parse given command line arguments"""
    description = "Wordle implementation in Python, as literate programming"
    # Bit overkill since there is no real argument to parse yet
    parser = argparse.ArgumentParser(prog="pywordle", description=description)
    return parser.parse_args(raw_args)
import argparse
from typing import Optional, Sequence
def play_game_args(raw_args: Optional[Sequence[str]] = None):
    """Play a standard Wordle game from stdin to stdout, given args"""
    _ = parse_args(raw_args)
    game = new_game()
    play_game(game=game, guess_fetcher=input, response_logger=print)
def main():
    """Pass sys.argv to the play_game_args function"""
    play_game_args(sys.argv[1:])
import sys
from literate_wordle.main import new_game, play_game

Since both our main and cli are meant to be untestable, because it’s the interactive entrypoint, it’s a bit unfair to compute coverage over it. Let’s blacklist these two files, preventing them weighing down coverage metric.

[run]
omit =
    # Don't compute coverage for these 2 manual invocation files
    src/literate_wordle/main.py
    src/literate_wordle/cli.py

Tangling it out

"""Entrypoint for pywordle"""


<<cli-main-import-std>>

<<cli-main-import-mod>>


<<cli-main1>>


<<cli-main2>>


<<cli-main3>>
"""Command line entrypoint for pywordle"""


<<cli-pargs-import-std1>>
<<cli-pargs-import-std3>>
<<cli-pargs-import-std2>>

<<cli-pargs-import-mod>>


<<cli-pargs1>>


<<cli-pargs2>>


<<cli-pargs3>>

Launching as CLI

In Python, when using Poetry like we are, the package is defined in pyproject.toml. To define a new command, this means using the tool.poetry.script key:

[tool.poetry.scripts]
pywordle = "literate_wordle.cli:main"

So we can now manually invoke this tool. And for the given argument parser, a help message should be available:

poetry run pywordle --help
usage: pywordle [-h]

:

Wordle implementation in Python, as literate programming

:

optional arguments:
  -h, --help  show this help message and exit

And we can play a round!

$ poetry shell
$ pywordle
hello
Try again! Guess number 1. Score is: ⬜🟨🟨⬜🟨
lobes
Try again! Guess number 2. Score is: 🟨🟩⬜🟩⬜
cranes
Guess too long
crane
Try again! Guess number 3. Score is: ⬜⬜⬜🟨🟨
novel
Correct! Game won in 4 guesses

Taking a step back, we’ve got command line launch of the game, and we can play with it. We’re done here, especially for a short experimental project.

But if this codebase was to be maintained, extended, reused, the bar for “acceptable” test coverage would be much higher.

For instance, we have no test overall on the game loop of guess input/output, despite all the layers below being pretty well covered. So I’d want tests that call the play_game function with scripted inputs and log the outputs, taking advantage of the dependency injection we set up to make proper UI-oriented tests. These would reveal, for instance, that when launching the game, there is nothing greeting us, no prompt for a guess, which is a usability issue.

In our case, that’s an exercise left for the reader.

Remember that testing’s primary goal is to increase our trust in the system we build.

In that vein, because we’ve got feature acceptance tests covered for every layer, the biggest source of uncertainty in the system is the implementation itself: we’re just not shaking out the code very much, beyond what a normal usage would look like. This calls for exploring the edge cases that code may have, regardless of intended features. Every string parameter should be tried with empty string, uppercase vs lowercase, different encoding, etc.

Conclusion

We just walked through building a simple wordle program from scratch, using literate programming to weave a novel’s worth of explanations and reasoning, with code blocks that export to the proper project code locations.

The project uses modern Python tooling (poetry, pytest) and uses formatters (black, isort), linters (flake8 with plugins), type checkers (mypy), and the project generates its own general documentation (including this page, if you’re reading it in a browser) and API reference (Sphinx with myst_parser for Markdown support), enforcing compliance of every tool via make and pre-commit.

The code was written in a Test-driven (TDD) way, as the tests always came before the feature itself, guiding how the implementation looks like, all the way to having 100% test coverage (whatever that means).

More importantly in my eyes, we only built what was strictly necessary, by using Behaviour-driven development (BDD, also called acceptance-test-driven development) to guide what subfeature to build next based on our needs. These specifications were encoded as Gherkin Features, available in a dedicated features/ folder, and thanks to the magic of Sphinx documentation, each of those are collected into a list of requirements in a dedicated Requirements page of the docs.

Since all of the feature files have associated acceptance tests that match the phrasing of the Gherkin features, future automation work could look at linking the requirements in Sphinx to the associated test file, so as to finally get full traceability from requirements, through specifications, to implementation and finally acceptance tests that pass.

This project was my first foray into literate programming at this scale, an attempt to bring together all the good ideas of TDD, modern Python development, Gherkin usage for requirements traceability purposes (without overly zealous extremes of Cucumber automation). All these ideas were until now scattered, implemented each without the others in different places, and this project fuses them into something I hope is more valuable than the sum of its parts.

If you like what you see here, have a look at my other writings, available on my blog: https://jiby.tech.

Post-script: scoring bug

A few weeks after initial release of the project, reader @gpiancastelli helpfully reported a major bug relating to guess scoring via Github. In this post-script note, I want to report here the process of investigating the bug, present how dissecting the issue made the fix emerge, and reflect on how such a bug could sneak in despite our careful approach.

I’m painfully aware of the ironic (and embarassing) aspect of writing a whole novel about “programming using best practices” only to get such a crucial point very, very wrong. It would be easy to hide this bug, retroactively change the narrative above, and pretend we got it right the first time. Instead, I believe there’s a lesson worth learning and sharing in there.

The bug report

The original bug report states (slightly abridged):

There’s a bug in your score_guess function. If the guess contains two copies of a letter, and that letter is present only once in the answer, and the second copy in the guess matches that letter in the answer, the first copy will be marked as WRONG_PLACE, while the second copy will be marked as NO.

[…] Let’s say we have A__A_ as our guess and ___A_ as the answer. Your score_guess function will return 🟨__⬜_ instead of ⬜__🟩_.

Incorrect scoring function sounds very serious indeed, so the first step is confirming the issue with a good testcase. Can we find words that match the rule:

# Pick an answer word ending with "n"
zgrep -iE "n\b" ./src/literate_wordle/assets/wordle_answers_dict.txt.gz
# Pick a guess-word ending with "n", and with another "n"
zgrep -iE "n.*n\b" ./src/literate_wordle/assets/wordle_accepted_words_dict.txt.gz

From the many results (those regular expressions are fairly vague), I manually chose the answer train and the guess xenon.

We want to show that score_guess is wrong, which is best done by adding a case to test_generic_score:

@pytest.mark.parametrize(
    "answer,our_guess,expected_score",
    [
        pytest.param("adage", "adobe", "🟩🟩⬜⬜🟩", id="normal_guess1"),
        pytest.param("serif", "quiet", "⬜⬜🟨🟨⬜", id="normal_guess2"),
        pytest.param("raise", "radix", "🟩🟩⬜🟨⬜", id="normal_guess3"),
        pytest.param("abbey", "kebab", "⬜🟨🟩🟨🟨", id="multi_occur1"),
        pytest.param("abbey", "babes", "🟨🟨🟩🟩⬜", id="multi_occur2"),
        pytest.param("abbey", "abyss", "🟩🟩🟨⬜⬜", id="multi_occur3"),
        pytest.param("abbey", "algae", "🟩⬜⬜⬜🟨", id="multi_occur4"),
        pytest.param("abbey", "keeps", "⬜🟨⬜⬜⬜", id="multi_occur5"),
        pytest.param("abbey", "abate", "🟩🟩⬜⬜🟨", id="multi_occur6"),
        pytest.param("train", "xenon", "⬜⬜⬜⬜🟩", id="multi_occur_issue1"),
    ],
)

Let’s run the tests to see the result:

make test
poetry run pytest
============ test session starts =============
platform linux -- Python 3.9.5, pytest-7.1.2, pluggy-1.0.0 -- /home/jiby/dev/ws/short/literate_wordle/.venv/bin/python
cachedir: .pytest_cache
rootdir: /home/jiby/dev/ws/short/literate_wordle, configfile: pyproject.toml
plugins: cov-3.0.0, clarity-1.0.1
collected 23 items

tests/test_checking_guess_valid_word.py::test_reject_long_words PASSED [  4%]
tests/test_checking_guess_valid_word.py::test_reject_overly_short_words PASSED [  8%]
tests/test_checking_guess_valid_word.py::test_reject_nondict_words PASSED [ 13%]
tests/test_checking_guess_valid_word.py::test_accept_dict_words PASSED [ 17%]
tests/test_pick_word.py::test_pick_word_ok_length PASSED [ 21%]
tests/test_scoring_guess.py::test_perfect_guess PASSED [ 26%]
tests/test_scoring_guess.py::test_no_common_character PASSED [ 30%]
tests/test_scoring_guess.py::test_wrong_place PASSED [ 34%]
tests/test_scoring_guess.py::test_generic_score[normal_guess1] PASSED [ 39%]
tests/test_scoring_guess.py::test_generic_score[normal_guess2] PASSED [ 43%]
tests/test_scoring_guess.py::test_generic_score[normal_guess3] PASSED [ 47%]
tests/test_scoring_guess.py::test_generic_score[multi_occur1] PASSED [ 52%]
tests/test_scoring_guess.py::test_generic_score[multi_occur2] PASSED [ 56%]
tests/test_scoring_guess.py::test_generic_score[multi_occur3] PASSED [ 60%]
tests/test_scoring_guess.py::test_generic_score[multi_occur4] PASSED [ 65%]
tests/test_scoring_guess.py::test_generic_score[multi_occur5] PASSED [ 69%]
tests/test_scoring_guess.py::test_generic_score[multi_occur6] PASSED [ 73%]
tests/test_scoring_guess.py::test_generic_score[multi_occur_issue1] FAILED [ 78%]
tests/test_track_guess_number.py::test_first_guess_allowed PASSED [ 82%]
tests/test_track_guess_number.py::test_sixth_guess_allowed PASSED [ 86%]
tests/test_track_guess_number.py::test_seventh_guess_fails_game PASSED [ 91%]
tests/test_track_guess_number.py::test_winning_guess_wins PASSED [ 95%]
tests/test_track_guess_number.py::test_invalid_guess_not_counted PASSED [100%]

================== FAILURES ==================
___ test_generic_score[multi_occur_issue1] ___

answer = 'train', our_guess = 'xenon'
expected_score = '⬜⬜⬜⬜🟩'

    @pytest.mark.parametrize(
        "answer,our_guess,expected_score",
        [
            pytest.param("adage", "adobe", "🟩🟩⬜⬜🟩", id="normal_guess1"),
            pytest.param("serif", "quiet", "⬜⬜🟨🟨⬜", id="normal_guess2"),
            pytest.param("raise", "radix", "🟩🟩⬜🟨⬜", id="normal_guess3"),
            pytest.param("abbey", "kebab", "⬜🟨🟩🟨🟨", id="multi_occur1"),
            pytest.param("abbey", "babes", "🟨🟨🟩🟩⬜", id="multi_occur2"),
            pytest.param("abbey", "abyss", "🟩🟩🟨⬜⬜", id="multi_occur3"),
            pytest.param("abbey", "algae", "🟩⬜⬜⬜🟨", id="multi_occur4"),
            pytest.param("abbey", "keeps", "⬜🟨⬜⬜⬜", id="multi_occur5"),
            pytest.param("abbey", "abate", "🟩🟩⬜⬜🟨", id="multi_occur6"),
            pytest.param("train", "xenon", "⬜⬜⬜⬜🟩", id="multi_occur_issue1"),
        ],
    )
    def test_generic_score(answer, our_guess, expected_score):
        """Scenario Outline: Scoring guesses"""
        # Given a wordle <answer>
        # When scoring <guess>
        score = score_guess(our_guess, answer)
        # Then score should be <score>
>       assert score == expected_score
E       assert == failed. [pytest-clarity diff shown]
E
E         LHS vs RHS shown below
E
E         ⬜⬜🟨⬜⬜
E         ⬜⬜⬜⬜🟩
E

tests/test_scoring_guess.py:68: AssertionError
- generated xml file: /home/jiby/dev/ws/short/literate_wordle/test_results/results.xml -

----------- coverage: platform linux, python 3.9.5-final-0 -----------
Name                                     Stmts   Miss  Cover
------------------------------------------------------------
src/literate_wordle/__init__.py              0      0   100%
src/literate_wordle/assets/__init__.py       0      0   100%
src/literate_wordle/game.py                 38      0   100%
src/literate_wordle/guess.py                25      0   100%
src/literate_wordle/words.py                32      0   100%
------------------------------------------------------------
TOTAL                                       95      0   100%
Coverage HTML written to dir test_results/coverage.html
Coverage XML written to file test_results/coverage.xml

========== short test summary info ===========
FAILED tests/test_scoring_guess.py::test_generic_score[multi_occur_issue1]
======== 1 failed, 22 passed in 0.15s ========
make: *** [Makefile:16: test] Error 1

Bug confirmed! Whoops.

Why are we scoring badly?

If necessary, we can step through the example code to figure out what’s wrong, and I did. But overall, it seems that our approach to scoring by looking at character in a single pass is at fault.

The approach falls down with the example we were given, because we don’t first detect the second n of xenon as matching the last n of train, which would make it scored OK (🟩), then in another pass detecting remaining, non-matching (⬜) in the first n. Instead, we run over characters in order, detect a n in the wrong place, score it as wrong-place (🟨), and by decreasing the occurence counter, the next one is counted non-matching (⬜), hence the bad score.

Thinking it through, it means that the single-pass scoring approach just cannot work, as we need to “look ahead”, knowing already the OK-ness of all guess characters before scoring the wrong-place-ness. Interesting!

So we will re-write this algorithm to work in two passes: First, detect exact matches of guess/answer character pairs, recording those as perfect score. Then, a second pairwise check looks for wrong-place score, defaulting to the mismatch “zero” score.

Fixing the issue

In order to score “out of order” (in two passes), the response needs to change from the original empty string being built, to some random-access structure: a list.

In designing the fix, we realise that a zero score, aka all-mismatch (⬜⬜⬜⬜⬜) is the “default” case of scoring. That is we “start” from that score, and score “up” by marking individual characters as matching.

We reflect that in the list initialisation, starting with the worst score as it means we avoid having to “detect” it anymore. That’s a tiny optimization of the code. But more importantly, this list is now randomly accessible, as we can now “peek ahead” when we couldn’t before.

def score_guess(guess: str, answer: str) -> str:
    """Score an individual guess with Counter"""
    # Counter("abbey") = Counter({'b': 2, 'a': 1, 'e': 1, 'y': 1})
    answer_chars = Counter(answer)
    # NO is the default score, no need to detect it explicitly
    response: list[str] = [CharacterScore.NO] * len(answer)
    # First pass to detect perfect scores
    for char_index, (answer_char, guess_char) in enumerate(zip(guess, answer)):
        if answer_char == guess_char:
            response[char_index] = CharacterScore.OK
            answer_chars[guess_char] -= 1
    # Second pass for the yellows
    for char_num, (guess_char, existing_score) in enumerate(zip(guess, response)):
        if existing_score == CharacterScore.OK:
            continue  # It's already green: skip
        if answer_chars[guess_char] > 0:
            response[char_num] = CharacterScore.WRONG_PLACE
            # Reduce occurence counter since we "used" this occurence
            answer_chars[guess_char] -= 1
    return "".join(response)

Note another minor change, we removed the check for guess_char in answer_chars. This was previously there to catch the case where the answer_chars dictionary didn’t have an entry for this guess_char, which meant trying to access it would raise a KeyError, so we’d protect agaisnt that.

But as @gpiancastelli also pointed out, a collections.Counter isn’t a regular dictionary, the documentation says: “Counter objects have a dictionary interface except that they return a zero count for missing items.”. This helpful divergence from regular dictionaries protects us already from that missing key issue, so the code can flow just a little more smoothly.

Had this been a raw dict, not a Counter, we could have used the get operator to set a default value on missing key, in the form answer_chars.get(guess_char, 0). We’d be trading off clarity for briefness. Not as elegant as what Counter allows!

Still, the bug is fixed, as attested by tests going green again. We also check linters are happy and coverage is good (they are, it is). All is well!

Tangling again

We just re-defined (overwrote) a few code blocks from previous sections, so we need to re-weave them together into a real file.

If we just “fixed” the tangling blocks of above, the story wouldn’t be in order, wouldn’t make sense.

So we redefine a few files here:

Examples: Reported bug: multiple occurence of same character in guess
  | answer | guess	| score		|
  | train  | xenon	| ⬜⬜⬜⬜🟩	|
"""Validates the Gherkin file features/scoring_guess.feature:

<<scoring-feature>>
"""

<<scoring-test-import-pytest>>

<<scoring-test-import>>


<<scoring-test1>>


<<scoring-test2>>


<<scoring-test3>>


<<scoring-multi-parameters2>>
<<scoring-multi-skeleton>>
<<scoring-guessmod-header>>


<<scoring-guessfunc-import>>
<<scoring-guess-enum-import>>


<<scoring-guess-enum>>

    <<track-guess-perfectscore>>


<<scoring-guessfunc-impl3>>

Why didn’t you catch this earlier? Is it TDD/BDD’s fault or are you just a bad dev?

We just found a bug, and fixed it. But why didn’t we catch it earlier!? Is TDD and BDD at fault? Can we just go back to coding without tests!?

I like to think that the process didn’t fail as much as my imagination did.

First, note how the Gherkin features, requirements gathering and so on did their job, we adequately planned for features, defined scenarios that did make sense, and implemented those correctly. So the BDD side delivered its value!

Purely TDD-wise, all the tests we defined were valid, and covered reasonable aspect of the features to help design the new functions’ shapes, nothing to say there either.

The failing was in the (lack of) diversity of scores used as examples: we didn’t cover a broad enough set of score samples to find issues like this one.

But finding this bug isn’t obvious: if you didn’t know about this particular bug (by reading the sourcecode and seeing a really non-obvious flaw), finding the bug would instead require playing randomly this game’s implementation until you find a bad score (which could take minutes or hours, due to the randomness involved), then reproducing the example + reporting it. This is likely what the bug reporter did, played around and found a bad case.

As a developer, I didn’t have any particular reason to suspect this specific scoring issue, so I didn’t develop a test case with it.

But I like to think that I was so close!

As you see in sections above, I was worried about scoring for multiple letters, as shown in the scoring example table. I remember this being a concern, because any naive implementation of wordle could miss the nuance of “the real Worlde”. I even broke out screenshots from the real Wordle website to make up some references, because I couldn’t explain to myself how the scoring should happen.

Unfortunately, my attention was on multiple identical characters in the answer, not in the guess.

So, again, I was close enough to look for similar bugs, but didn’t quite find a diverse enough set of sample scores to unearth this particular issue.

Why should this be a failure story?

Before we go, I want to flip the narrative around this bug:

The way I see it, I built a fun implementation of Wordle to play with Python, TDD and BDD. I spent a reasonable amount of time on “due diligence research” around edge cases (seen in above section) to feel good about the solution.

Isolating the bug (by adding a single line to the tests), and fixing it (a few paragraphs, one function) was a minuscule amount of additional effort, thanks to our strong test harness.

Avoiding the bug in the first place would have cost a lot more time, doing research into 100% compatibility with existing Wordle implementations, likely having to connect someone else’s Wordle code to ours to compare (with all the associated issues to deal with), for comparatively minor benefits.

This isn’t NASA, who has a single chance to send rockets, and (comparatively) infinite engineering time to plan it. In our case, the cost of making the system robust can be prohibitive. The discipline of Engineering is about balancing acceptable risks against the costs of reliability.

So, despite having to issue a rectification to this narrative, I still believe the amount of pre-production research was sufficient: We did nothing wrong here.

This bugfix also showcases the iterative nature of software development: Earlier sections demonstrated feature addition as incremental changes, but we see here that refining the solution when it’s subtly wrong is an iterative process too!

So, yeah, building code to be correct the first time is hard. Or maybe almost impossible. Or even not the best course of action for you!

The best way to build code is to “make it work, make it right, then make it fast” in that order.