Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a coverage bot example #75

Closed
wants to merge 1 commit into from
Closed

Conversation

brandonwillard
Copy link
Member

@brandonwillard brandonwillard commented Apr 29, 2023

This PR introduces a toy prompting loop that requests code completions for full-coverage unit tests of a given Python function. It uses coverage to find uncovered lines and provide that as more-explicit feedback in the prompt loop.

It is currently just a very simple, potentially multi-stage code completion task with goal verification (i.e. a conveniently accessible, numeric, and highly specific—in the NL prompt interface—metric for the goal), and perhaps a somewhat trivial RL problem. It serves as a fun example and focal point for many of our general interests, though. For instance, it seems like a task that mixes well with our existing fine-tuning interests.

The following is an example dialog
ATTEMPT: 0

PROMPT:
The Python module code is as follows:

def some_function(x, y):
    if x < 0:
        z = y - x
    else:
        z = y + x

    if z > y:
        return True

    return False


Print only the code for a completed version of the following Python function named `test_some_function` that attains full coverage over the Python module code above:

def test_some_function():

RESPONSE:
def test_some_function():
    assert some_function(1, 2) == True
    assert some_function(-1, 2) == True
    assert some_function(0, 2) == False
    assert some_function(-2, 2) == False
    assert some_function(-2, 4) == True
    assert some_function(2, 4) == True
    assert some_function(0, -1) == True
    assert some_function(1, 0) == False
    assert some_function(-1, -2) == False


ATTEMPT: 0

PROMPT:
The Python module code is as follows:

def some_other_function(x: int, y: int) -> bool:
    z = 0
    for i in range(x):
        if x < 3:
            z = y - x
        else:
            z = y + x

    if z > y:
        return True

    return False


Print only the code for a completed version of the following Python function named `test_some_function` that attains full coverage over the Python module code above:

def test_some_function():

RESPONSE:
import pytest

def test_some_function():
    assert some_other_function(2, 5) == True
    assert some_other_function(1, 3) == False
    assert some_other_function(4, 6) == True
    assert some_other_function(0, 8) == False
    assert some_other_function(3, 7) == True
    assert some_other_function(5, 1) == False
    assert some_other_function(6, 10) == True
    assert some_other_function(2, 2) == False

if __name__ == "__main__":
    pytest.main()


ATTEMPT: 1

PROMPT:
The Python module code is as follows:

def some_other_function(x: int, y: int) -> bool:
    z = 0
    for i in range(x):
        if x < 3:
            z = y - x
        else:
            z = y + x

    if z > y:
        return True

    return False


Print only the code for a completed version of the following Python function named `test_some_function` that attains coverage for lines 8 and 11 in the Python module code above:

import pytest

def test_some_function():
    assert some_other_function(2, 5) == True
    assert some_other_function(1, 3) == False
    assert some_other_function(4, 6) == True
    assert some_other_function(0, 8) == False
    assert some_other_function(3, 7) == True
    assert some_other_function(5, 1) == False
    assert some_other_function(6, 10) == True
    assert some_other_function(2, 2) == False

if __name__ == "__main__":
    pytest.main()

RESPONSE:
import pytest

def test_some_function():
    assert some_other_function(2, 5) == True
    assert some_other_function(1, 3) == False
    assert some_other_function(4, 6) == True
    assert some_other_function(0, 8) == False
    assert some_other_function(3, 7) == True
    assert some_other_function(5, 1) == False
    assert some_other_function(6, 10) == True
    assert some_other_function(2, 2) == False
    
    # test for line 8
    assert some_other_function(2, 1) == True
    
    # test for line 11
    assert some_other_function(9, 5) == False
    
if __name__ == "__main__":
    pytest.main()


ATTEMPT: 2

PROMPT:
The Python module code is as follows:

def some_other_function(x: int, y: int) -> bool:
    z = 0
    for i in range(x):
        if x < 3:
            z = y - x
        else:
            z = y + x

    if z > y:
        return True

    return False


Print only the code for a completed version of the following Python function named `test_some_function` that attains coverage for lines 8 and 11 in the Python module code above:

import pytest

def test_some_function():
    assert some_other_function(2, 5) == True
    assert some_other_function(1, 3) == False
    assert some_other_function(4, 6) == True
    assert some_other_function(0, 8) == False
    assert some_other_function(3, 7) == True
    assert some_other_function(5, 1) == False
    assert some_other_function(6, 10) == True
    assert some_other_function(2, 2) == False
    
    # test for line 8
    assert some_other_function(2, 1) == True
    
    # test for line 11
    assert some_other_function(9, 5) == False
    
if __name__ == "__main__":
    pytest.main()

RESPONSE:
import pytest

def test_some_function():
    assert some_other_function(2, 5) == True
    assert some_other_function(1, 3) == False
    assert some_other_function(4, 6) == True
    assert some_other_function(0, 8) == False
    assert some_other_function(3, 7) == True
    assert some_other_function(5, 1) == False
    assert some_other_function(6, 10) == True
    assert some_other_function(2, 2) == False
    
    # test for line 8
    assert some_other_function(2, 1) == True
    assert some_other_function(3, 0) == True
    
    # test for line 11
    assert some_other_function(9, 5) == False
    assert some_other_function(4, 2) == False
    
if __name__ == "__main__":
    pytest.main()


ATTEMPT: 3

PROMPT:
The Python module code is as follows:

def some_other_function(x: int, y: int) -> bool:
    z = 0
    for i in range(x):
        if x < 3:
            z = y - x
        else:
            z = y + x

    if z > y:
        return True

    return False


Print only the code for a completed version of the following Python function named `test_some_function` that attains coverage for lines 8 and 11 in the Python module code above:

import pytest

def test_some_function():
    assert some_other_function(2, 5) == True
    assert some_other_function(1, 3) == False
    assert some_other_function(4, 6) == True
    assert some_other_function(0, 8) == False
    assert some_other_function(3, 7) == True
    assert some_other_function(5, 1) == False
    assert some_other_function(6, 10) == True
    assert some_other_function(2, 2) == False
    
    # test for line 8
    assert some_other_function(2, 1) == True
    assert some_other_function(3, 0) == True
    
    # test for line 11
    assert some_other_function(9, 5) == False
    assert some_other_function(4, 2) == False
    
if __name__ == "__main__":
    pytest.main()

RESPONSE:
import pytest

def test_some_function():
    assert some_other_function(2, 5) == True
    assert some_other_function(1, 3) == False
    assert some_other_function(4, 6) == True
    assert some_other_function(0, 8) == False
    assert some_other_function(3, 7) == True
    assert some_other_function(5, 1) == False
    assert some_other_function(6, 10) == True
    assert some_other_function(2, 2) == False
    
    # test for line 8
    assert some_other_function(2, 1) == True
    assert some_other_function(3, 0) == True
    
    # test for line 11
    assert some_other_function(9, 5) == False
    assert some_other_function(4, 2) == False
    
    # test for line 8 and 11
    assert some_other_function(1, 10) == True


ATTEMPT: 4

PROMPT:
The Python module code is as follows:

def some_other_function(x: int, y: int) -> bool:
    z = 0
    for i in range(x):
        if x < 3:
            z = y - x
        else:
            z = y + x

    if z > y:
        return True

    return False


Print only the code for a completed version of the following Python function named `test_some_function` that attains coverage for lines 8 and 11 in the Python module code above:

import pytest

def test_some_function():
    assert some_other_function(2, 5) == True
    assert some_other_function(1, 3) == False
    assert some_other_function(4, 6) == True
    assert some_other_function(0, 8) == False
    assert some_other_function(3, 7) == True
    assert some_other_function(5, 1) == False
    assert some_other_function(6, 10) == True
    assert some_other_function(2, 2) == False
    
    # test for line 8
    assert some_other_function(2, 1) == True
    assert some_other_function(3, 0) == True
    
    # test for line 11
    assert some_other_function(9, 5) == False
    assert some_other_function(4, 2) == False
    
    # test for line 8 and 11
    assert some_other_function(1, 10) == True

RESPONSE:
import pytest

def test_some_function():
    assert some_other_function(2, 5) == True
    assert some_other_function(1, 3) == False
    assert some_other_function(4, 6) == True
    assert some_other_function(0, 8) == False
    assert some_other_function(3, 7) == True
    assert some_other_function(5, 1) == False
    assert some_other_function(6, 10) == True
    assert some_other_function(2, 2) == False
    
    # test for line 8
    assert some_other_function(2, 1) == True
    assert some_other_function(3, 0) == True
    
    # test for line 11
    assert some_other_function(9, 5) == False
    assert some_other_function(4, 2) == False
    
    # test for line 8 and 11
    assert some_other_function(1, 10) == True

Extensions

  • The code needs to be refactored to make more/better use of outlines,
    We want this example—to the best of its ability—to highlight the reasons why one should use outlines. In that sense, it can also serve as a means of generating ideas for outlines itself.
  • This code needs tests for itself!
    For instance, we don't examples that exhibit the following responses:
    • a partially completed test response that needs more iterations and converges,
    • one that doesn't converge, and
    • nonsense/invalid code.
  • Ultimately, we need to see if we can get this code to produces tests for itself. If it can't, we can at least try to get some inspiration for good, self-contained unit tests from the results and inch our way toward something more interesting.
  • Adjust the prompts to add constraints on the code produced. For instance, we don't want redundant tests—wrt. coverage, at least.
    Some of this could be performed outside of the LM, and we'll need to consider what should and shouldn't be.
    • Likewise, we should experiment with the amount of contextual information and formalism the LM can constructively use (e.g. type checking with mypy).
  • Use the pytest framework to automate the application of the process.
  • The ability to extract and define testable elements of a target codebase.
    Seems like we should perform an initial coverage pass over the target code with no tests and use those results to determine testable elements. We could also restrict ourselves to module-level functions for now.
  • Handling of incomplete/non-convergent responses produced by recursion (i.e. prompting loop/iteration) limits. For instance, we could pytest.mark.skip unfinished tests.
  • Perhaps some human-in-the-loop element that allows one to aid in the prompting process and arrival of a valid solution.
  • Consider some interesting integrations and comparisons with existing test-constructing/automating tools (e.g. Hypothesis, Pynguin).

@rlouf
Copy link
Member

rlouf commented Apr 29, 2023

I updated the code to use Outline's prompting DSL, and define the target functions as python functions instead of strings. I think that we could now compute coverage by evaluating the response using e.g. ast.eval (not safe for production use) instead of writing a file and loading it.


completer = models.text_completion.openai("gpt-3.5-turbo")

# Naive way to prevent ourselves from spamming the service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we're not making calls in parallel the response time is long enough that you won't be spamming the service.

run_statements = "\n".join([create_call(fname) for fname in test_function_names])

c1 = f"""
{inspect.getsource(target_code)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since target_code is now a Python function is there a way we could run coverage by evaluating the string that represents the test function and calling pytest between cov.start() and cov.stop()? You'd also get quicker feedback on whether the test is valid Python code.

"""


def query(target_code, test_code, lines):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use instead

query = outlines.text.function(
    completor,
    construct_prompt,
    collect_test_function
)

@rlouf rlouf force-pushed the main branch 2 times, most recently from 7bd0f4d to a2e0b68 Compare May 4, 2023 12:39
@brandonwillard
Copy link
Member Author

I think we need to revert it so that the input code is strictly string-based again, because it's more difficult to construct examples—especially interactively—and I don't know how well it will work when automated (e.g. to run on an entire codebase).

@brandonwillard brandonwillard force-pushed the cover-bot-example branch 2 times, most recently from 086c16a to fcc62ac Compare May 18, 2023 21:07
@dpsimpson
Copy link

This is a whacky thought and far out of the scope of this PR, but it would be pretty cool if this would link with a linter and use the linter output to generate prompts (eg "there's no package X in module Y" can be converted into a prompt)

@brandonwillard
Copy link
Member Author

This is a whacky thought and far out of the scope of this PR, but it would be pretty cool if this would link with a linter and use the linter output to generate prompts (eg "there's no package X in module Y" can be converted into a prompt)

Yes, definitely! One of the checkboxes in the PR description mentions this and references MyPy as an example, but the same applies to all other linter-provided static analysis information.

@rlouf
Copy link
Member

rlouf commented Nov 8, 2023

Should we merge this?

@brandonwillard
Copy link
Member Author

Should we merge this?

It needs some real reworking, so we should probably just close it for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants