-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a coverage bot example #75
Conversation
36f84a9
to
b6ea2c7
Compare
I updated the code to use Outline's prompting DSL, and define the target functions as python functions instead of strings. I think that we could now compute coverage by evaluating the response using e.g. |
ddf8c0c
to
d4126c3
Compare
examples/cover_bot.py
Outdated
|
||
completer = models.text_completion.openai("gpt-3.5-turbo") | ||
|
||
# Naive way to prevent ourselves from spamming the service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we're not making calls in parallel the response time is long enough that you won't be spamming the service.
examples/cover_bot.py
Outdated
run_statements = "\n".join([create_call(fname) for fname in test_function_names]) | ||
|
||
c1 = f""" | ||
{inspect.getsource(target_code)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since target_code
is now a Python function is there a way we could run coverage by evaluating the string that represents the test function and calling pytest between cov.start()
and cov.stop()
? You'd also get quicker feedback on whether the test is valid Python code.
examples/cover_bot.py
Outdated
""" | ||
|
||
|
||
def query(target_code, test_code, lines): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use instead
query = outlines.text.function(
completor,
construct_prompt,
collect_test_function
)
7bd0f4d
to
a2e0b68
Compare
I think we need to revert it so that the input code is strictly string-based again, because it's more difficult to construct examples—especially interactively—and I don't know how well it will work when automated (e.g. to run on an entire codebase). |
086c16a
to
fcc62ac
Compare
This is a whacky thought and far out of the scope of this PR, but it would be pretty cool if this would link with a linter and use the linter output to generate prompts (eg "there's no package X in module Y" can be converted into a prompt) |
Yes, definitely! One of the checkboxes in the PR description mentions this and references MyPy as an example, but the same applies to all other linter-provided static analysis information. |
fcc62ac
to
8f09fa5
Compare
8f09fa5
to
d5d2cb1
Compare
d5d2cb1
to
4feef0e
Compare
4feef0e
to
c540fd8
Compare
Should we merge this? |
It needs some real reworking, so we should probably just close it for now. |
This PR introduces a toy prompting loop that requests code completions for full-coverage unit tests of a given Python function. It uses
coverage
to find uncovered lines and provide that as more-explicit feedback in the prompt loop.It is currently just a very simple, potentially multi-stage code completion task with goal verification (i.e. a conveniently accessible, numeric, and highly specific—in the NL prompt interface—metric for the goal), and perhaps a somewhat trivial RL problem. It serves as a fun example and focal point for many of our general interests, though. For instance, it seems like a task that mixes well with our existing fine-tuning interests.
The following is an example dialog
Extensions
outlines
,We want this example—to the best of its ability—to highlight the reasons why one should use
outlines
. In that sense, it can also serve as a means of generating ideas foroutlines
itself.For instance, we don't examples that exhibit the following responses:
Some of this could be performed outside of the LM, and we'll need to consider what should and shouldn't be.
mypy
).pytest
framework to automate the application of the process.Seems like we should perform an initial coverage pass over the target code with no tests and use those results to determine testable elements. We could also restrict ourselves to module-level functions for now.
pytest.mark.skip
unfinished tests.