Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Prompts via YAML #37

Merged
merged 44 commits into from
Jul 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
0f09bc5
Adds ManuscriptPromptConfig, which performs filename->prompt resoluti…
falquaddoomi Jan 10, 2024
f2f96a7
Integrates prompt->filename resolution into ManuscriptEditor.revise_m…
falquaddoomi Jan 10, 2024
f85aa87
Adds a few tests for ManuscriptPromptConfig
falquaddoomi Jan 10, 2024
d513eef
get_prompt_for_filename() now returns "__IGNORE_FILE__" for the promp…
falquaddoomi Jan 31, 2024
5704998
Adds testing utility mock_unify_open(), which allows us to merge manu…
falquaddoomi Jan 31, 2024
68ac05e
Adds tests for four scenarios mentioned in issue 31. Adds ignore test…
falquaddoomi Jan 31, 2024
de5b3c9
add pre-commit config
miltondp Feb 2, 2024
49ab292
fix python code style
miltondp Feb 2, 2024
21f3c3a
remove old test file
miltondp Feb 2, 2024
335aace
setup.py: update version
miltondp Feb 14, 2024
9b93c08
add DebuggingManuscriptRevisionModel
miltondp Feb 14, 2024
cc3adb0
update DebuggingManuscriptRevisionModel
miltondp Feb 14, 2024
40432f0
update DebuggingManuscriptRevisionModel to pretty print json
miltondp Feb 14, 2024
4dac807
Fixes issue where default_prompt in ai_revision-config.yaml was treat…
falquaddoomi Mar 13, 2024
f3e7a91
Applied typing suggestions from pylance
falquaddoomi Mar 13, 2024
acb1d3d
Strips prompts of leading and trailing spaces
falquaddoomi Mar 13, 2024
bbd9b08
Adds warning about both files.matchings and prompts_files being speci…
falquaddoomi Mar 13, 2024
2e7d897
Updates phenoplier prompt config tests to treat the config prompt ref…
falquaddoomi Mar 13, 2024
e8dfb23
Fixes issue where resolved_prompt's value wasn't being propogated dow…
falquaddoomi Mar 27, 2024
55972b4
Applies placeholder replacements to ai_revision-derived prompts, as i…
falquaddoomi Mar 27, 2024
73870ae
Adds a bit more debugging output to DebuggingManuscriptRevisionModel,…
falquaddoomi Mar 27, 2024
264d150
Changed 'default_prompt' in the tests to refer to a prompt key and no…
falquaddoomi Apr 4, 2024
88ba10e
Changed invalid field references to items included in the format() ca…
falquaddoomi Apr 4, 2024
e27a996
Adds an end-to-end test that applies custom prompts via the Debugging…
falquaddoomi Apr 4, 2024
8b92c45
Adds e2e test that uses the GPT3 model to revise each paragraph to in…
falquaddoomi Apr 5, 2024
8c39e8f
Removes unnecessary printing of params in DebuggingManuscriptRevision…
falquaddoomi Apr 10, 2024
4377d80
Per MP's idea, adds single-paragraph version of phenoplier, used in t…
falquaddoomi Apr 10, 2024
7084475
models: change model_engine to gpt-3.5-turbo
miltondp Apr 17, 2024
d672346
Updated model_basics tests to check for "gpt-3.5-turbo", not "text-da…
falquaddoomi Apr 17, 2024
fa28c19
DebuggingManuscriptRevisionModel() no longer supplies default title o…
falquaddoomi Apr 18, 2024
1fb08b8
Removes hardcoded front-matter ignore in favor of using other ignore …
falquaddoomi Apr 18, 2024
6c08714
Adds test for correct title, keywords in resulting prompts.
falquaddoomi Apr 18, 2024
98b174d
Updates a few DebuggingManuscriptRevisionModel() uses that didn't sup…
falquaddoomi Apr 18, 2024
8cf6ca8
Finishes updating test for title inclusion in test_prompts_in_final_r…
falquaddoomi Apr 18, 2024
f9f89d1
Updates test that previously discounted front-matter to include it, s…
falquaddoomi Apr 18, 2024
8ea6d51
Factors out replacements/placeholders dicts into a shared placeholder…
falquaddoomi Apr 18, 2024
3ae73de
DebuggingManuscriptRevisionModel provides a default title and keyword…
falquaddoomi May 1, 2024
1766a78
Adds default prompt from MP that overrides section-specific canned pr…
falquaddoomi Jul 4, 2024
c68394a
Adds info to the root README about configuring prompts. Adds a separa…
falquaddoomi Jul 4, 2024
de8bb1f
Added skips for section-specific prompt tests when DEFAULT_PROMPT_OVE…
falquaddoomi Jul 4, 2024
6a5dacb
Update docs/custom-prompts.md
falquaddoomi Jul 5, 2024
8867468
Fleshed out section_name description, removed link to personal repo
falquaddoomi Jul 6, 2024
31e8363
Mock-patches DEFAULT_PROMPT_OVERRIDE to None for tests that need it b…
falquaddoomi Jul 8, 2024
310e9c4
Removes DEFAULT_PROMPT_OVERRIDE, reverts tests
falquaddoomi Jul 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
default_language_version:
python: python3.10
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
# Check for files that contain merge conflict strings.
- id: check-merge-conflict
# Check for debugger imports and py37+ `breakpoint()` calls in python source.
- id: debug-statements
# Replaces or checks mixed line ending
- id: mixed-line-ending
# Check for files that would conflict in case-insensitive filesystems
- id: check-case-conflict
# This hook checks toml files for parseable syntax.
- id: check-toml
# This hook checks yaml files for parseable syntax.
- id: check-yaml
- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: v0.2.0
hooks:
- id: ruff
args:
- --fix
- repo: https://github.com/python/black
rev: 24.1.1
hooks:
- id: black
language_version: python3
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,21 @@ The Manubot AI Editor can be used from the GitHub repository of a Manubot-based
You first need to follow the steps to [setup a Manubot-based manuscript](https://github.com/manubot/rootstock).
Then, follow [these instructions](https://github.com/manubot/rootstock/blob/main/USAGE.md#ai-assisted-authoring) to setup a workflow in GitHub Actions that will allow you to quickly trigger a job to revise your manuscript.

### Configuring Prompts

In order to revise your manuscript, prompts must be provided to the AI model. There are two ways to do this:
- **Default prompts**: you can use the default prompts provided by the tool, in which case you don't need to do anything.
- **Custom prompts**: you can define your own prompts to apply to specific files using YAML configuration files that you include with your manuscript.

The default prompt, which should work for most manuscripts, is the following:

```
Proofread the following paragraph that is part of a scientific manuscript.
Keep all Markdown formatting, citations to other articles, mathematical expressions, and equations.
```

If you wish to customize the prompts on a per-file basis, see [docs/custom-prompts.md](docs/custom-prompts.md) for more information.

### Command line

To use the tool from the command line, you first need to install Manubot in a Python environment:
Expand Down
131 changes: 131 additions & 0 deletions docs/custom-prompts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Custom Prompts

Rather than using the default prompt, you can specify custom prompts for each file in your manuscript.
This can be useful when you want specific sections of your manuscript to be revised in specific ways, or not revised at all.

There are two ways that you can use the custom prompts system:
1. You can define your prompts and how they map to your manuscript files in a single file, `ai-revision_prompts.yaml`.
2. You can create the `ai-revision_prompts.yaml`, but only specify prompts and identifiers, which makes it suitable for sharing with others who have different names for their manuscripts' files.
You would then specify a second file, `ai-revision_config.yaml`, that maps the prompt identifiers to the actual files in your manuscript.

These files should be placed in the `content` directory alongside your manuscript markdown files.

See [Functionality Notes](#functionality-notes) later in this document for more information on how to write regular expressions and use placeholders in your prompts.

## Approach 1: Single file

With this approach, you can define your prompts and how they map to your manuscript files in a single file.
The single file should be named `ai-revision_prompts.yaml` and placed in the `content` folder.

The file would look something like the following:

```yaml
prompts_files:
# filenames are specified as regular expressions
# in this case, we match a file named exactly 'filename.md'
^filename\.md$: "Prompt text here"

# you can use YAML's multi-line string syntax to write longer prompts
# you can also use {placeholders} to include metadata from your manuscript
^filename\.md$: |
Revise the following paragraph from a manuscript titled {title}
so that it sounds like an academic paper.

# specifying the special value 'null' will skip revising any files that
# match this regular expression
^ignore_this_file\.md$: null
```

Note that, for each file, the first matching regular expression will determine its prompt or whether the file is skipped.
Even if a file matches multiple regexes, only the first one will be used.


## Approach 2: Prompt file plus configuration file

In this case, we specify two files, `ai-revision_prompts.yaml` and `ai-revision_config.yaml`.

The `ai-revision_prompts.yaml` file contains only the prompts and their identifiers.
The top-level element is `prompts` in this case rather than `prompts_files`, as it defines a set of resuable prompts and not prompt-file mappings.

Here's an example of what the `ai-revision_prompts.yaml` file might look like:
```yaml
prompts:
intro_prompt: "Prompt text here"
content_prompts: |
Revise the following paragraph from a manuscript titled {title}
so that it sounds like an academic paper.

my_default: "Revise this paragraph so it sounds nicer."
```

The `ai-revision_config.yaml` file maps the prompt identifiers to the actual files in your manuscript.

An example of the `ai-revision_config.yaml` file:
```yaml
files:
matchings:
- files:
- ^introduction\.md$
prompt: intro_prompt
- files:
- ^methods\.md$
- ^abstract\.md$
prompt: content_prompts

# the special value default_prompt is used when no other regex matches
# it also uses a prompt identifier taken from ai-revision_prompts.yaml
default_prompt: my_default

# any file you want to be skipped can be specified in this list
ignores:
- ^ignore_this_file\.md$
```

Multiple regexes can be specified in a list under `files` to match multiple files to a single prompt.

In this case, the `default_prompt` is used when no other regex matches, and it uses a prompt identifier taken from `ai-revision_prompts.yaml`.

The `ignores` list specifies files that should be skipped entirely during the revision process; they won't have the default prompt applied to them.


## Functionality Notes

### Filenames as Regular Expressions
falquaddoomi marked this conversation as resolved.
Show resolved Hide resolved

Filenames in either approach are specified as regular expressions (aka "regexes").
This allows you to flexibly match multiple files to a prompt with a single expression.

A simple example: to specify an exact match for, say, `myfile.md`, you'd supply the regular expression `^myfile\.md$`, where:
- `^` matches the beginning of the filename
- `\.` matches a literal period -- otherwise, `.` means "any character"
- `$` matches the end of the filename

To illustrate why that syntax is important: if you were to write it as `myfile.md`, the `.` would match any character, so it would match `myfileAmd`, `myfile2md`, etc.
Without the `^` and `$`, it would match also match filenames like `asdf_myfile.md`, `myfile.md_asdf`, and `asdf_myfile.md.txt`.

The benefit of using regexes becomes more apparent when you have multiple files.
For example, say you had three files, `02.prior-work.md`, `02.methods.md`, and `02.results.md`. To match all of these, you could use the expression `^02\..*\.md$`.
This would match any file beginning with `02.` and ending with `.md`.
Here, `.` again indicates "any character" and the `*` means "zero or more of the preceding character; together, they match any sequence of characters.

You can find more information on how to write regular expressions in [Python's `re` module documentation](https://docs.python.org/3/library/re.html#regular-expression-syntax).


### Placeholders

The prompt text can include metadata from your manuscript, specified in `content/metadata.yaml` in Manubot. Writing
`{placeholder}` into your prompt text will cause it to be replaced with the corresponding value, drawn either
from the manuscript metadata or from the current file/paragraph being revised.

The following placeholders are available:
- `{title}`: the title of the manuscript, as defined in the metadata
- `{keywords}`: comma-delimited keywords from the manuscript metadata
- `{paragraph_text}`: the text from the current paragraph
- `{section_name}`: the name of the section (which is one of the following values "abstract", "introduction", "results", "discussion", "conclusions", "methods" or "supplementary material"), derived from the filename.

The `section_name` placeholder works like so:
- if the env var `AI_EDITOR_FILENAME_SECTION_MAPPING` is specified, it will be interpreted as a dictionary mapping filenames to section names.
If a key of the dictionary is included in the filename, the value will be used as the section name.
Also the keys and values can be any string, not just one of the section names mentioned before.
- If the dict mentioned above is unset or the filename doesn't match any of its keys, the filename will be matched against the following values: "introduction", "methods", "results", "discussion", "conclusions" or "supplementary".
If the values are contained within the filename, the section name will be mapped to that value. "supplementary" is replaced with "supplementary material", but the others are used as is.
45 changes: 35 additions & 10 deletions libs/manubot_ai_editor/editor.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from pathlib import Path

from manubot_ai_editor import env_vars
from manubot_ai_editor.prompt_config import ManuscriptPromptConfig, IGNORE_FILE
from manubot_ai_editor.models import ManuscriptRevisionModel
from manubot_ai_editor.utils import (
get_yaml_field,
Expand All @@ -27,6 +28,10 @@ def __init__(self, content_dir: str | Path):
self.title = get_yaml_field(metadata_file, "title")
self.keywords = get_yaml_field(metadata_file, "keywords")

self.prompt_config = ManuscriptPromptConfig(
content_dir=content_dir, title=self.title, keywords=self.keywords
)

@staticmethod
def prepare_paragraph(paragraph: list[str]) -> str:
"""
Expand Down Expand Up @@ -81,6 +86,7 @@ def revise_and_write_paragraph(
paragraph: list[str],
revision_model: ManuscriptRevisionModel,
section_name: str = None,
resolved_prompt: str = None,
outfile=None,
) -> None | tuple[str, str]:
"""
Expand All @@ -89,6 +95,7 @@ def revise_and_write_paragraph(
Arguments:
paragraph: list of lines of the paragraph.
section_name: name of the section the paragraph belongs to.
resolved_prompt: a prompt resolved via the ai_revision prompt config; None if unavailable
revision_model: model to use for revision.
outfile: file object to write the revised paragraph to.

Expand All @@ -114,8 +121,7 @@ def revise_and_write_paragraph(
error_message = None
try:
paragraph_revised = revision_model.revise_paragraph(
paragraph_text,
section_name,
paragraph_text, section_name, resolved_prompt=resolved_prompt
)

if paragraph_revised.strip() == "":
Expand Down Expand Up @@ -248,6 +254,7 @@ def revise_file(
output_dir: Path | str,
revision_model: ManuscriptRevisionModel,
section_name: str = None,
resolved_prompt: str = None,
):
"""
It revises an entire Markdown file and writes the revised file to the output directory.
Expand All @@ -258,6 +265,7 @@ def revise_file(
output_dir (Path | str): path to the directory where the revised file will be written.
revision_model (ManuscriptRevisionModel): model to use for revision.
section_name (str, optional): Defaults to None. If so, it will be inferred from the filename.
resolved_prompt (str, optional): A prompt resolved via ai_revision prompt config files, which overrides any custom or section-derived prompts; None if unavailable.
"""
input_filepath = self.content_dir / input_filename
assert input_filepath.exists(), f"Input file {input_filepath} does not exist"
Expand Down Expand Up @@ -376,7 +384,11 @@ def revise_file(

# revise and write paragraph to output file
self.revise_and_write_paragraph(
paragraph, revision_model, section_name, outfile
paragraph,
revision_model,
section_name,
resolved_prompt=resolved_prompt,
outfile=outfile,
)

# clear the paragraph list
Expand Down Expand Up @@ -418,7 +430,11 @@ def revise_file(
# output file
if paragraph:
self.revise_and_write_paragraph(
paragraph, revision_model, section_name, outfile
paragraph,
revision_model,
section_name,
resolved_prompt=resolved_prompt,
outfile=outfile,
)

def revise_manuscript(
Expand Down Expand Up @@ -446,14 +462,22 @@ def revise_manuscript(
filenames_to_revise = None

for filename in sorted(self.content_dir.glob("*.md")):
# ignore front-matter file
if "front-matter" in filename.name:
continue

filename_section = self.get_section_from_filename(filename.name)

# we do not process the file if it has no section and there is no custom prompt
if filename_section is None and (
# use the ai_revision prompt config to attempt to resolve a prompt
resolved_prompt, _ = self.prompt_config.get_prompt_for_filename(
filename.name
)

# ignore the file if the ai-revision_* config files told us to
if resolved_prompt == IGNORE_FILE:
continue

# we do not process the file if all hold:
# 1. it has no section *or* resolved prompt
# 2. we're unable to resolve it via ai_revision prompt configuration
# 2. there is no custom prompt
if (filename_section is None and resolved_prompt is None) and (
env_vars.CUSTOM_PROMPT not in os.environ
or os.environ[env_vars.CUSTOM_PROMPT].strip() == ""
):
Expand All @@ -472,4 +496,5 @@ def revise_manuscript(
output_dir,
revision_model,
section_name=filename_section,
resolved_prompt=resolved_prompt,
)
Loading