Add a coverage bot example #75

brandonwillard · 2023-04-29T08:56:15Z

This PR introduces a toy prompting loop that requests code completions for full-coverage unit tests of a given Python function. It uses coverage to find uncovered lines and provide that as more-explicit feedback in the prompt loop.

It is currently just a very simple, potentially multi-stage code completion task with goal verification (i.e. a conveniently accessible, numeric, and highly specific—in the NL prompt interface—metric for the goal), and perhaps a somewhat trivial RL problem. It serves as a fun example and focal point for many of our general interests, though. For instance, it seems like a task that mixes well with our existing fine-tuning interests.

The following is an example dialog

ATTEMPT: 0

PROMPT:
The Python module code is as follows:

def some_function(x, y):
    if x < 0:
        z = y - x
    else:
        z = y + x

    if z > y:
        return True

    return False


Print only the code for a completed version of the following Python function named `test_some_function` that attains full coverage over the Python module code above:

def test_some_function():

RESPONSE:
def test_some_function():
    assert some_function(1, 2) == True
    assert some_function(-1, 2) == True
    assert some_function(0, 2) == False
    assert some_function(-2, 2) == False
    assert some_function(-2, 4) == True
    assert some_function(2, 4) == True
    assert some_function(0, -1) == True
    assert some_function(1, 0) == False
    assert some_function(-1, -2) == False


ATTEMPT: 0

PROMPT:
The Python module code is as follows:

def some_other_function(x: int, y: int) -> bool:
    z = 0
    for i in range(x):
        if x < 3:
            z = y - x
        else:
            z = y + x

    if z > y:
        return True

    return False


Print only the code for a completed version of the following Python function named `test_some_function` that attains full coverage over the Python module code above:

def test_some_function():

RESPONSE:
import pytest

def test_some_function():
    assert some_other_function(2, 5) == True
    assert some_other_function(1, 3) == False
    assert some_other_function(4, 6) == True
    assert some_other_function(0, 8) == False
    assert some_other_function(3, 7) == True
    assert some_other_function(5, 1) == False
    assert some_other_function(6, 10) == True
    assert some_other_function(2, 2) == False

if __name__ == "__main__":
    pytest.main()


ATTEMPT: 1

PROMPT:
The Python module code is as follows:

def some_other_function(x: int, y: int) -> bool:
    z = 0
    for i in range(x):
        if x < 3:
            z = y - x
        else:
            z = y + x

    if z > y:
        return True

    return False


Print only the code for a completed version of the following Python function named `test_some_function` that attains coverage for lines 8 and 11 in the Python module code above:

import pytest

def test_some_function():
    assert some_other_function(2, 5) == True
    assert some_other_function(1, 3) == False
    assert some_other_function(4, 6) == True
    assert some_other_function(0, 8) == False
    assert some_other_function(3, 7) == True
    assert some_other_function(5, 1) == False
    assert some_other_function(6, 10) == True
    assert some_other_function(2, 2) == False

if __name__ == "__main__":
    pytest.main()

RESPONSE:
import pytest

def test_some_function():
    assert some_other_function(2, 5) == True
    assert some_other_function(1, 3) == False
    assert some_other_function(4, 6) == True
    assert some_other_function(0, 8) == False
    assert some_other_function(3, 7) == True
    assert some_other_function(5, 1) == False
    assert some_other_function(6, 10) == True
    assert some_other_function(2, 2) == False
    
    # test for line 8
    assert some_other_function(2, 1) == True
    
    # test for line 11
    assert some_other_function(9, 5) == False
    
if __name__ == "__main__":
    pytest.main()


ATTEMPT: 2

PROMPT:
The Python module code is as follows:

def some_other_function(x: int, y: int) -> bool:
    z = 0
    for i in range(x):
        if x < 3:
            z = y - x
        else:
            z = y + x

    if z > y:
        return True

    return False


Print only the code for a completed version of the following Python function named `test_some_function` that attains coverage for lines 8 and 11 in the Python module code above:

import pytest

def test_some_function():
    assert some_other_function(2, 5) == True
    assert some_other_function(1, 3) == False
    assert some_other_function(4, 6) == True
    assert some_other_function(0, 8) == False
    assert some_other_function(3, 7) == True
    assert some_other_function(5, 1) == False
    assert some_other_function(6, 10) == True
    assert some_other_function(2, 2) == False
    
    # test for line 8
    assert some_other_function(2, 1) == True
    
    # test for line 11
    assert some_other_function(9, 5) == False
    
if __name__ == "__main__":
    pytest.main()

RESPONSE:
import pytest

def test_some_function():
    assert some_other_function(2, 5) == True
    assert some_other_function(1, 3) == False
    assert some_other_function(4, 6) == True
    assert some_other_function(0, 8) == False
    assert some_other_function(3, 7) == True
    assert some_other_function(5, 1) == False
    assert some_other_function(6, 10) == True
    assert some_other_function(2, 2) == False
    
    # test for line 8
    assert some_other_function(2, 1) == True
    assert some_other_function(3, 0) == True
    
    # test for line 11
    assert some_other_function(9, 5) == False
    assert some_other_function(4, 2) == False
    
if __name__ == "__main__":
    pytest.main()


ATTEMPT: 3

PROMPT:
The Python module code is as follows:

def some_other_function(x: int, y: int) -> bool:
    z = 0
    for i in range(x):
        if x < 3:
            z = y - x
        else:
            z = y + x

    if z > y:
        return True

    return False


Print only the code for a completed version of the following Python function named `test_some_function` that attains coverage for lines 8 and 11 in the Python module code above:

import pytest

def test_some_function():
    assert some_other_function(2, 5) == True
    assert some_other_function(1, 3) == False
    assert some_other_function(4, 6) == True
    assert some_other_function(0, 8) == False
    assert some_other_function(3, 7) == True
    assert some_other_function(5, 1) == False
    assert some_other_function(6, 10) == True
    assert some_other_function(2, 2) == False
    
    # test for line 8
    assert some_other_function(2, 1) == True
    assert some_other_function(3, 0) == True
    
    # test for line 11
    assert some_other_function(9, 5) == False
    assert some_other_function(4, 2) == False
    
if __name__ == "__main__":
    pytest.main()

RESPONSE:
import pytest

def test_some_function():
    assert some_other_function(2, 5) == True
    assert some_other_function(1, 3) == False
    assert some_other_function(4, 6) == True
    assert some_other_function(0, 8) == False
    assert some_other_function(3, 7) == True
    assert some_other_function(5, 1) == False
    assert some_other_function(6, 10) == True
    assert some_other_function(2, 2) == False
    
    # test for line 8
    assert some_other_function(2, 1) == True
    assert some_other_function(3, 0) == True
    
    # test for line 11
    assert some_other_function(9, 5) == False
    assert some_other_function(4, 2) == False
    
    # test for line 8 and 11
    assert some_other_function(1, 10) == True


ATTEMPT: 4

PROMPT:
The Python module code is as follows:

def some_other_function(x: int, y: int) -> bool:
    z = 0
    for i in range(x):
        if x < 3:
            z = y - x
        else:
            z = y + x

    if z > y:
        return True

    return False


Print only the code for a completed version of the following Python function named `test_some_function` that attains coverage for lines 8 and 11 in the Python module code above:

import pytest

def test_some_function():
    assert some_other_function(2, 5) == True
    assert some_other_function(1, 3) == False
    assert some_other_function(4, 6) == True
    assert some_other_function(0, 8) == False
    assert some_other_function(3, 7) == True
    assert some_other_function(5, 1) == False
    assert some_other_function(6, 10) == True
    assert some_other_function(2, 2) == False
    
    # test for line 8
    assert some_other_function(2, 1) == True
    assert some_other_function(3, 0) == True
    
    # test for line 11
    assert some_other_function(9, 5) == False
    assert some_other_function(4, 2) == False
    
    # test for line 8 and 11
    assert some_other_function(1, 10) == True

RESPONSE:
import pytest

def test_some_function():
    assert some_other_function(2, 5) == True
    assert some_other_function(1, 3) == False
    assert some_other_function(4, 6) == True
    assert some_other_function(0, 8) == False
    assert some_other_function(3, 7) == True
    assert some_other_function(5, 1) == False
    assert some_other_function(6, 10) == True
    assert some_other_function(2, 2) == False
    
    # test for line 8
    assert some_other_function(2, 1) == True
    assert some_other_function(3, 0) == True
    
    # test for line 11
    assert some_other_function(9, 5) == False
    assert some_other_function(4, 2) == False
    
    # test for line 8 and 11
    assert some_other_function(1, 10) == True

Extensions

rlouf · 2023-04-29T12:11:52Z

I updated the code to use Outline's prompting DSL, and define the target functions as python functions instead of strings. I think that we could now compute coverage by evaluating the response using e.g. ast.eval (not safe for production use) instead of writing a file and loading it.

rlouf · 2023-04-30T11:55:37Z

examples/cover_bot.py

+
+    completer = models.text_completion.openai("gpt-3.5-turbo")
+
+    # Naive way to prevent ourselves from spamming the service


As long as we're not making calls in parallel the response time is long enough that you won't be spamming the service.

rlouf · 2023-04-30T12:01:50Z

examples/cover_bot.py

+    run_statements = "\n".join([create_call(fname) for fname in test_function_names])
+
+    c1 = f"""
+{inspect.getsource(target_code)}


Since target_code is now a Python function is there a way we could run coverage by evaluating the string that represents the test function and calling pytest between cov.start() and cov.stop()? You'd also get quicker feedback on whether the test is valid Python code.

rlouf · 2023-04-30T12:04:07Z

examples/cover_bot.py

+    """
+
+
+def query(target_code, test_code, lines):


Could use instead

query = outlines.text.function( completor, construct_prompt, collect_test_function )

brandonwillard · 2023-05-04T15:45:59Z

I think we need to revert it so that the input code is strictly string-based again, because it's more difficult to construct examples—especially interactively—and I don't know how well it will work when automated (e.g. to run on an entire codebase).

dpsimpson · 2023-05-23T18:19:16Z

This is a whacky thought and far out of the scope of this PR, but it would be pretty cool if this would link with a linter and use the linter output to generate prompts (eg "there's no package X in module Y" can be converted into a prompt)

brandonwillard · 2023-05-25T19:10:37Z

This is a whacky thought and far out of the scope of this PR, but it would be pretty cool if this would link with a linter and use the linter output to generate prompts (eg "there's no package X in module Y" can be converted into a prompt)

Yes, definitely! One of the checkboxes in the PR description mentions this and references MyPy as an example, but the same applies to all other linter-provided static analysis information.

rlouf · 2023-11-08T23:31:21Z

Should we merge this?

brandonwillard · 2023-11-08T23:40:21Z

Should we merge this?

It needs some real reworking, so we should probably just close it for now.

brandonwillard added enhancement question labels Apr 29, 2023

brandonwillard requested review from rlouf and dgerlanc April 29, 2023 09:00

brandonwillard force-pushed the cover-bot-example branch from 36f84a9 to b6ea2c7 Compare April 29, 2023 09:04

brandonwillard force-pushed the cover-bot-example branch from ddf8c0c to d4126c3 Compare April 29, 2023 23:30

rlouf reviewed May 1, 2023

View reviewed changes

rlouf force-pushed the main branch 2 times, most recently from 7bd0f4d to a2e0b68 Compare May 4, 2023 12:39

brandonwillard force-pushed the cover-bot-example branch 2 times, most recently from 086c16a to fcc62ac Compare May 18, 2023 21:07

brandonwillard force-pushed the cover-bot-example branch from fcc62ac to 8f09fa5 Compare May 25, 2023 22:46

brandonwillard force-pushed the cover-bot-example branch from 8f09fa5 to d5d2cb1 Compare June 2, 2023 16:08

brandonwillard force-pushed the cover-bot-example branch from d5d2cb1 to 4feef0e Compare September 25, 2023 19:58

Add a coverage bot example

c540fd8

brandonwillard force-pushed the cover-bot-example branch from 4feef0e to c540fd8 Compare September 25, 2023 21:00

brandonwillard closed this Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a coverage bot example #75

Add a coverage bot example #75

brandonwillard commented Apr 29, 2023 •

edited

Loading

rlouf commented Apr 29, 2023 •

edited

Loading

rlouf Apr 30, 2023

rlouf Apr 30, 2023

rlouf Apr 30, 2023

brandonwillard commented May 4, 2023

dpsimpson commented May 23, 2023

brandonwillard commented May 25, 2023

rlouf commented Nov 8, 2023

brandonwillard commented Nov 8, 2023


		completer = models.text_completion.openai("gpt-3.5-turbo")

		# Naive way to prevent ourselves from spamming the service

Add a coverage bot example #75

Add a coverage bot example #75

Conversation

brandonwillard commented Apr 29, 2023 • edited Loading

Extensions

rlouf commented Apr 29, 2023 • edited Loading

rlouf Apr 30, 2023

Choose a reason for hiding this comment

rlouf Apr 30, 2023

Choose a reason for hiding this comment

rlouf Apr 30, 2023

Choose a reason for hiding this comment

brandonwillard commented May 4, 2023

dpsimpson commented May 23, 2023

brandonwillard commented May 25, 2023

rlouf commented Nov 8, 2023

brandonwillard commented Nov 8, 2023

brandonwillard commented Apr 29, 2023 •

edited

Loading

rlouf commented Apr 29, 2023 •

edited

Loading