Skip to content

Commit

Permalink
Merge pull request #37 from falquaddoomi/issue-31-customprompts-yaml
Browse files Browse the repository at this point in the history
Custom Prompts via YAML
  • Loading branch information
falquaddoomi authored Jul 9, 2024
2 parents 4aa8d0b + 310e9c4 commit 7a33085
Show file tree
Hide file tree
Showing 62 changed files with 22,176 additions and 1,180 deletions.
29 changes: 29 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
default_language_version:
python: python3.10
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
# Check for files that contain merge conflict strings.
- id: check-merge-conflict
# Check for debugger imports and py37+ `breakpoint()` calls in python source.
- id: debug-statements
# Replaces or checks mixed line ending
- id: mixed-line-ending
# Check for files that would conflict in case-insensitive filesystems
- id: check-case-conflict
# This hook checks toml files for parseable syntax.
- id: check-toml
# This hook checks yaml files for parseable syntax.
- id: check-yaml
- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: v0.2.0
hooks:
- id: ruff
args:
- --fix
- repo: https://github.com/python/black
rev: 24.1.1
hooks:
- id: black
language_version: python3
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,21 @@ The Manubot AI Editor can be used from the GitHub repository of a Manubot-based
You first need to follow the steps to [setup a Manubot-based manuscript](https://github.com/manubot/rootstock).
Then, follow [these instructions](https://github.com/manubot/rootstock/blob/main/USAGE.md#ai-assisted-authoring) to setup a workflow in GitHub Actions that will allow you to quickly trigger a job to revise your manuscript.

### Configuring Prompts

In order to revise your manuscript, prompts must be provided to the AI model. There are two ways to do this:
- **Default prompts**: you can use the default prompts provided by the tool, in which case you don't need to do anything.
- **Custom prompts**: you can define your own prompts to apply to specific files using YAML configuration files that you include with your manuscript.

The default prompt, which should work for most manuscripts, is the following:

```
Proofread the following paragraph that is part of a scientific manuscript.
Keep all Markdown formatting, citations to other articles, mathematical expressions, and equations.
```

If you wish to customize the prompts on a per-file basis, see [docs/custom-prompts.md](docs/custom-prompts.md) for more information.

### Command line

To use the tool from the command line, you first need to install Manubot in a Python environment:
Expand Down
131 changes: 131 additions & 0 deletions docs/custom-prompts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Custom Prompts

Rather than using the default prompt, you can specify custom prompts for each file in your manuscript.
This can be useful when you want specific sections of your manuscript to be revised in specific ways, or not revised at all.

There are two ways that you can use the custom prompts system:
1. You can define your prompts and how they map to your manuscript files in a single file, `ai-revision_prompts.yaml`.
2. You can create the `ai-revision_prompts.yaml`, but only specify prompts and identifiers, which makes it suitable for sharing with others who have different names for their manuscripts' files.
You would then specify a second file, `ai-revision_config.yaml`, that maps the prompt identifiers to the actual files in your manuscript.

These files should be placed in the `content` directory alongside your manuscript markdown files.

See [Functionality Notes](#functionality-notes) later in this document for more information on how to write regular expressions and use placeholders in your prompts.

## Approach 1: Single file

With this approach, you can define your prompts and how they map to your manuscript files in a single file.
The single file should be named `ai-revision_prompts.yaml` and placed in the `content` folder.

The file would look something like the following:

```yaml
prompts_files:
# filenames are specified as regular expressions
# in this case, we match a file named exactly 'filename.md'
^filename\.md$: "Prompt text here"

# you can use YAML's multi-line string syntax to write longer prompts
# you can also use {placeholders} to include metadata from your manuscript
^filename\.md$: |
Revise the following paragraph from a manuscript titled {title}
so that it sounds like an academic paper.
# specifying the special value 'null' will skip revising any files that
# match this regular expression
^ignore_this_file\.md$: null
```
Note that, for each file, the first matching regular expression will determine its prompt or whether the file is skipped.
Even if a file matches multiple regexes, only the first one will be used.
## Approach 2: Prompt file plus configuration file
In this case, we specify two files, `ai-revision_prompts.yaml` and `ai-revision_config.yaml`.

The `ai-revision_prompts.yaml` file contains only the prompts and their identifiers.
The top-level element is `prompts` in this case rather than `prompts_files`, as it defines a set of resuable prompts and not prompt-file mappings.

Here's an example of what the `ai-revision_prompts.yaml` file might look like:
```yaml
prompts:
intro_prompt: "Prompt text here"
content_prompts: |
Revise the following paragraph from a manuscript titled {title}
so that it sounds like an academic paper.
my_default: "Revise this paragraph so it sounds nicer."
```

The `ai-revision_config.yaml` file maps the prompt identifiers to the actual files in your manuscript.

An example of the `ai-revision_config.yaml` file:
```yaml
files:
matchings:
- files:
- ^introduction\.md$
prompt: intro_prompt
- files:
- ^methods\.md$
- ^abstract\.md$
prompt: content_prompts
# the special value default_prompt is used when no other regex matches
# it also uses a prompt identifier taken from ai-revision_prompts.yaml
default_prompt: my_default
# any file you want to be skipped can be specified in this list
ignores:
- ^ignore_this_file\.md$
```

Multiple regexes can be specified in a list under `files` to match multiple files to a single prompt.

In this case, the `default_prompt` is used when no other regex matches, and it uses a prompt identifier taken from `ai-revision_prompts.yaml`.

The `ignores` list specifies files that should be skipped entirely during the revision process; they won't have the default prompt applied to them.


## Functionality Notes

### Filenames as Regular Expressions

Filenames in either approach are specified as regular expressions (aka "regexes").
This allows you to flexibly match multiple files to a prompt with a single expression.

A simple example: to specify an exact match for, say, `myfile.md`, you'd supply the regular expression `^myfile\.md$`, where:
- `^` matches the beginning of the filename
- `\.` matches a literal period -- otherwise, `.` means "any character"
- `$` matches the end of the filename

To illustrate why that syntax is important: if you were to write it as `myfile.md`, the `.` would match any character, so it would match `myfileAmd`, `myfile2md`, etc.
Without the `^` and `$`, it would match also match filenames like `asdf_myfile.md`, `myfile.md_asdf`, and `asdf_myfile.md.txt`.

The benefit of using regexes becomes more apparent when you have multiple files.
For example, say you had three files, `02.prior-work.md`, `02.methods.md`, and `02.results.md`. To match all of these, you could use the expression `^02\..*\.md$`.
This would match any file beginning with `02.` and ending with `.md`.
Here, `.` again indicates "any character" and the `*` means "zero or more of the preceding character; together, they match any sequence of characters.

You can find more information on how to write regular expressions in [Python's `re` module documentation](https://docs.python.org/3/library/re.html#regular-expression-syntax).


### Placeholders

The prompt text can include metadata from your manuscript, specified in `content/metadata.yaml` in Manubot. Writing
`{placeholder}` into your prompt text will cause it to be replaced with the corresponding value, drawn either
from the manuscript metadata or from the current file/paragraph being revised.

The following placeholders are available:
- `{title}`: the title of the manuscript, as defined in the metadata
- `{keywords}`: comma-delimited keywords from the manuscript metadata
- `{paragraph_text}`: the text from the current paragraph
- `{section_name}`: the name of the section (which is one of the following values "abstract", "introduction", "results", "discussion", "conclusions", "methods" or "supplementary material"), derived from the filename.

The `section_name` placeholder works like so:
- if the env var `AI_EDITOR_FILENAME_SECTION_MAPPING` is specified, it will be interpreted as a dictionary mapping filenames to section names.
If a key of the dictionary is included in the filename, the value will be used as the section name.
Also the keys and values can be any string, not just one of the section names mentioned before.
- If the dict mentioned above is unset or the filename doesn't match any of its keys, the filename will be matched against the following values: "introduction", "methods", "results", "discussion", "conclusions" or "supplementary".
If the values are contained within the filename, the section name will be mapped to that value. "supplementary" is replaced with "supplementary material", but the others are used as is.
45 changes: 35 additions & 10 deletions libs/manubot_ai_editor/editor.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from pathlib import Path

from manubot_ai_editor import env_vars
from manubot_ai_editor.prompt_config import ManuscriptPromptConfig, IGNORE_FILE
from manubot_ai_editor.models import ManuscriptRevisionModel
from manubot_ai_editor.utils import (
get_yaml_field,
Expand All @@ -27,6 +28,10 @@ def __init__(self, content_dir: str | Path):
self.title = get_yaml_field(metadata_file, "title")
self.keywords = get_yaml_field(metadata_file, "keywords")

self.prompt_config = ManuscriptPromptConfig(
content_dir=content_dir, title=self.title, keywords=self.keywords
)

@staticmethod
def prepare_paragraph(paragraph: list[str]) -> str:
"""
Expand Down Expand Up @@ -81,6 +86,7 @@ def revise_and_write_paragraph(
paragraph: list[str],
revision_model: ManuscriptRevisionModel,
section_name: str = None,
resolved_prompt: str = None,
outfile=None,
) -> None | tuple[str, str]:
"""
Expand All @@ -89,6 +95,7 @@ def revise_and_write_paragraph(
Arguments:
paragraph: list of lines of the paragraph.
section_name: name of the section the paragraph belongs to.
resolved_prompt: a prompt resolved via the ai_revision prompt config; None if unavailable
revision_model: model to use for revision.
outfile: file object to write the revised paragraph to.
Expand All @@ -114,8 +121,7 @@ def revise_and_write_paragraph(
error_message = None
try:
paragraph_revised = revision_model.revise_paragraph(
paragraph_text,
section_name,
paragraph_text, section_name, resolved_prompt=resolved_prompt
)

if paragraph_revised.strip() == "":
Expand Down Expand Up @@ -248,6 +254,7 @@ def revise_file(
output_dir: Path | str,
revision_model: ManuscriptRevisionModel,
section_name: str = None,
resolved_prompt: str = None,
):
"""
It revises an entire Markdown file and writes the revised file to the output directory.
Expand All @@ -258,6 +265,7 @@ def revise_file(
output_dir (Path | str): path to the directory where the revised file will be written.
revision_model (ManuscriptRevisionModel): model to use for revision.
section_name (str, optional): Defaults to None. If so, it will be inferred from the filename.
resolved_prompt (str, optional): A prompt resolved via ai_revision prompt config files, which overrides any custom or section-derived prompts; None if unavailable.
"""
input_filepath = self.content_dir / input_filename
assert input_filepath.exists(), f"Input file {input_filepath} does not exist"
Expand Down Expand Up @@ -376,7 +384,11 @@ def revise_file(

# revise and write paragraph to output file
self.revise_and_write_paragraph(
paragraph, revision_model, section_name, outfile
paragraph,
revision_model,
section_name,
resolved_prompt=resolved_prompt,
outfile=outfile,
)

# clear the paragraph list
Expand Down Expand Up @@ -418,7 +430,11 @@ def revise_file(
# output file
if paragraph:
self.revise_and_write_paragraph(
paragraph, revision_model, section_name, outfile
paragraph,
revision_model,
section_name,
resolved_prompt=resolved_prompt,
outfile=outfile,
)

def revise_manuscript(
Expand Down Expand Up @@ -446,14 +462,22 @@ def revise_manuscript(
filenames_to_revise = None

for filename in sorted(self.content_dir.glob("*.md")):
# ignore front-matter file
if "front-matter" in filename.name:
continue

filename_section = self.get_section_from_filename(filename.name)

# we do not process the file if it has no section and there is no custom prompt
if filename_section is None and (
# use the ai_revision prompt config to attempt to resolve a prompt
resolved_prompt, _ = self.prompt_config.get_prompt_for_filename(
filename.name
)

# ignore the file if the ai-revision_* config files told us to
if resolved_prompt == IGNORE_FILE:
continue

# we do not process the file if all hold:
# 1. it has no section *or* resolved prompt
# 2. we're unable to resolve it via ai_revision prompt configuration
# 2. there is no custom prompt
if (filename_section is None and resolved_prompt is None) and (
env_vars.CUSTOM_PROMPT not in os.environ
or os.environ[env_vars.CUSTOM_PROMPT].strip() == ""
):
Expand All @@ -472,4 +496,5 @@ def revise_manuscript(
output_dir,
revision_model,
section_name=filename_section,
resolved_prompt=resolved_prompt,
)
Loading

0 comments on commit 7a33085

Please sign in to comment.