Merge pull request #37 from falquaddoomi/issue-31-customprompts-yaml

Custom Prompts via YAML
manubot · Jul 9, 2024 · 7a33085 · 7a33085
2 parents 4aa8d0b + 310e9c4
commit 7a33085
Show file tree

Hide file tree

Showing 62 changed files with 22,176 additions and 1,180 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,29 @@
+default_language_version:
+  python: python3.10
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v4.5.0
+    hooks:
+      # Check for files that contain merge conflict strings.
+      - id: check-merge-conflict
+      # Check for debugger imports and py37+ `breakpoint()` calls in python source.
+      - id: debug-statements
+      # Replaces or checks mixed line ending
+      - id: mixed-line-ending
+      # Check for files that would conflict in case-insensitive filesystems
+      - id: check-case-conflict
+      # This hook checks toml files for parseable syntax.
+      - id: check-toml
+      # This hook checks yaml files for parseable syntax.
+      - id: check-yaml
+  - repo: https://github.com/charliermarsh/ruff-pre-commit
+    rev: v0.2.0
+    hooks:
+      - id: ruff
+        args:
+        - --fix
+  - repo: https://github.com/python/black
+    rev: 24.1.1
+    hooks:
+      - id: black
+        language_version: python3
diff --git a/README.md b/README.md
@@ -12,6 +12,21 @@ The Manubot AI Editor can be used from the GitHub repository of a Manubot-based
 You first need to follow the steps to [setup a Manubot-based manuscript](https://github.com/manubot/rootstock).
 Then, follow [these instructions](https://github.com/manubot/rootstock/blob/main/USAGE.md#ai-assisted-authoring) to setup a workflow in GitHub Actions that will allow you to quickly trigger a job to revise your manuscript.
 
+### Configuring Prompts
+
+In order to revise your manuscript, prompts must be provided to the AI model. There are two ways to do this:
+- **Default prompts**: you can use the default prompts provided by the tool, in which case you don't need to do anything.
+- **Custom prompts**: you can define your own prompts to apply to specific files using YAML configuration files that you include with your manuscript.
+
+The default prompt, which should work for most manuscripts, is the following:
+
+```
+Proofread the following paragraph that is part of a scientific manuscript.
+Keep all Markdown formatting, citations to other articles, mathematical expressions, and equations.
+```
+
+If you wish to customize the prompts on a per-file basis, see [docs/custom-prompts.md](docs/custom-prompts.md) for more information.
+
 ### Command line
 
 To use the tool from the command line, you first need to install Manubot in a Python environment:

diff --git a/docs/custom-prompts.md b/docs/custom-prompts.md
@@ -0,0 +1,131 @@
+# Custom Prompts
+
+Rather than using the default prompt, you can specify custom prompts for each file in your manuscript.
+This can be useful when you want specific sections of your manuscript to be revised in specific ways, or not revised at all.
+
+There are two ways that you can use the custom prompts system:
+1. You can define your prompts and how they map to your manuscript files in a single file, `ai-revision_prompts.yaml`.
+2. You can create the `ai-revision_prompts.yaml`, but only specify prompts and identifiers, which makes it suitable for sharing with others who have different names for their manuscripts' files.
+You would then specify a second file, `ai-revision_config.yaml`, that maps the prompt identifiers to the actual files in your manuscript.
+
+These files should be placed in the `content` directory alongside your manuscript markdown files.
+
+See [Functionality Notes](#functionality-notes) later in this document for more information on how to write regular expressions and use placeholders in your prompts.
+
+## Approach 1: Single file
+
+With this approach, you can define your prompts and how they map to your manuscript files in a single file.
+The single file should be named `ai-revision_prompts.yaml` and placed in the `content` folder.
+
+The file would look something like the following:
+
+```yaml
+prompts_files:
+  # filenames are specified as regular expressions
+  # in this case, we match a file named exactly 'filename.md'
+  ^filename\.md$: "Prompt text here"
+
+  # you can use YAML's multi-line string syntax to write longer prompts
+  # you can also use {placeholders} to include metadata from your manuscript
+  ^filename\.md$: |
+    Revise the following paragraph from a manuscript titled {title}
+    so that it sounds like an academic paper.
+
+  # specifying the special value 'null' will skip revising any files that
+  # match this regular expression
+  ^ignore_this_file\.md$: null
+```
+
+Note that, for each file, the first matching regular expression will determine its prompt or whether the file is skipped.
+Even if a file matches multiple regexes, only the first one will be used.
+
+
+## Approach 2: Prompt file plus configuration file
+
+In this case, we specify two files, `ai-revision_prompts.yaml` and `ai-revision_config.yaml`.
+
+The `ai-revision_prompts.yaml` file contains only the prompts and their identifiers.
+The top-level element is `prompts` in this case rather than `prompts_files`, as it defines a set of resuable prompts and not prompt-file mappings.
+
+Here's an example of what the `ai-revision_prompts.yaml` file might look like:
+```yaml
+prompts:
+  intro_prompt: "Prompt text here"
+  content_prompts: |
+    Revise the following paragraph from a manuscript titled {title}
+    so that it sounds like an academic paper.
+
+  my_default: "Revise this paragraph so it sounds nicer."
+```
+
+The `ai-revision_config.yaml` file maps the prompt identifiers to the actual files in your manuscript.
+
+An example of the `ai-revision_config.yaml` file:
+```yaml
+files:
+  matchings:
+    - files:
+        - ^introduction\.md$
+      prompt: intro_prompt
+    - files:
+        - ^methods\.md$
+        - ^abstract\.md$
+      prompt: content_prompts
+
+  # the special value default_prompt is used when no other regex matches
+  # it also uses a prompt identifier taken from ai-revision_prompts.yaml
+  default_prompt: my_default
+
+  # any file you want to be skipped can be specified in this list
+  ignores:
+    - ^ignore_this_file\.md$
+```
+
+Multiple regexes can be specified in a list under `files` to match multiple files to a single prompt.
+
+In this case, the `default_prompt` is used when no other regex matches, and it uses a prompt identifier taken from `ai-revision_prompts.yaml`.
+
+The `ignores` list specifies files that should be skipped entirely during the revision process; they won't have the default prompt applied to them.
+
+
+## Functionality Notes
+
+### Filenames as Regular Expressions
+
+Filenames in either approach are specified as regular expressions (aka "regexes").
+This allows you to flexibly match multiple files to a prompt with a single expression.
+
+A simple example: to specify an exact match for, say, `myfile.md`, you'd supply the regular expression `^myfile\.md$`, where:
+- `^` matches the beginning of the filename
+- `\.` matches a literal period -- otherwise, `.` means "any character"
+- `$` matches the end of the filename
+
+To illustrate why that syntax is important: if you were to write it as `myfile.md`, the `.` would match any character, so it would match `myfileAmd`, `myfile2md`, etc.
+Without the `^` and `$`, it would match also match filenames like `asdf_myfile.md`, `myfile.md_asdf`, and `asdf_myfile.md.txt`.
+
+The benefit of using regexes becomes more apparent when you have multiple files.
+For example, say you had three files, `02.prior-work.md`, `02.methods.md`, and `02.results.md`. To match all of these, you could use the expression `^02\..*\.md$`.
+This would match any file beginning with `02.` and ending with `.md`.
+Here, `.` again indicates "any character" and the `*` means "zero or more of the preceding character; together, they match any sequence of characters.
+
+You can find more information on how to write regular expressions in [Python's `re` module documentation](https://docs.python.org/3/library/re.html#regular-expression-syntax).
+
+
+### Placeholders
+
+The prompt text can include metadata from your manuscript, specified in `content/metadata.yaml` in Manubot. Writing
+`{placeholder}` into your prompt text will cause it to be replaced with the corresponding value, drawn either
+from the manuscript metadata or from the current file/paragraph being revised.
+
+The following placeholders are available:
+- `{title}`: the title of the manuscript, as defined in the metadata
+- `{keywords}`: comma-delimited keywords from the manuscript metadata
+- `{paragraph_text}`: the text from the current paragraph
+- `{section_name}`: the name of the section (which is one of the following values "abstract",  "introduction", "results", "discussion", "conclusions", "methods" or "supplementary material"), derived from the filename.
+
+The `section_name` placeholder works like so:
+- if the env var `AI_EDITOR_FILENAME_SECTION_MAPPING` is specified, it will be interpreted as a dictionary mapping filenames to section names.
+If a key of the dictionary is included in the filename, the value will be used as the section name.
+Also the keys and values can be any string, not just one of the section names mentioned before.
+- If the dict mentioned above is unset or the filename doesn't match any of its keys, the filename will be matched against the following values: "introduction", "methods", "results", "discussion", "conclusions" or "supplementary".
+If the values are contained within the filename, the section name will be mapped to that value. "supplementary" is replaced with "supplementary material", but the others are used as is.
diff --git a/libs/manubot_ai_editor/editor.py b/libs/manubot_ai_editor/editor.py
@@ -3,6 +3,7 @@
 from pathlib import Path
 
 from manubot_ai_editor import env_vars
+from manubot_ai_editor.prompt_config import ManuscriptPromptConfig, IGNORE_FILE
 from manubot_ai_editor.models import ManuscriptRevisionModel
 from manubot_ai_editor.utils import (
     get_yaml_field,
@@ -27,6 +28,10 @@ def __init__(self, content_dir: str | Path):
         self.title = get_yaml_field(metadata_file, "title")
         self.keywords = get_yaml_field(metadata_file, "keywords")
 
+        self.prompt_config = ManuscriptPromptConfig(
+            content_dir=content_dir, title=self.title, keywords=self.keywords
+        )
+
     @staticmethod
     def prepare_paragraph(paragraph: list[str]) -> str:
         """
@@ -81,6 +86,7 @@ def revise_and_write_paragraph(
         paragraph: list[str],
         revision_model: ManuscriptRevisionModel,
         section_name: str = None,
+        resolved_prompt: str = None,
         outfile=None,
     ) -> None | tuple[str, str]:
         """
@@ -89,6 +95,7 @@ def revise_and_write_paragraph(
         Arguments:
             paragraph: list of lines of the paragraph.
             section_name: name of the section the paragraph belongs to.
+            resolved_prompt: a prompt resolved via the ai_revision prompt config; None if unavailable
             revision_model: model to use for revision.
             outfile: file object to write the revised paragraph to.
 
@@ -114,8 +121,7 @@ def revise_and_write_paragraph(
         error_message = None
         try:
             paragraph_revised = revision_model.revise_paragraph(
-                paragraph_text,
-                section_name,
+                paragraph_text, section_name, resolved_prompt=resolved_prompt
             )
 
             if paragraph_revised.strip() == "":
@@ -248,6 +254,7 @@ def revise_file(
         output_dir: Path | str,
         revision_model: ManuscriptRevisionModel,
         section_name: str = None,
+        resolved_prompt: str = None,
     ):
         """
         It revises an entire Markdown file and writes the revised file to the output directory.
@@ -258,6 +265,7 @@ def revise_file(
             output_dir (Path | str): path to the directory where the revised file will be written.
             revision_model (ManuscriptRevisionModel): model to use for revision.
             section_name (str, optional): Defaults to None. If so, it will be inferred from the filename.
+            resolved_prompt (str, optional): A prompt resolved via ai_revision prompt config files, which overrides any custom or section-derived prompts; None if unavailable.
         """
         input_filepath = self.content_dir / input_filename
         assert input_filepath.exists(), f"Input file {input_filepath} does not exist"
@@ -376,7 +384,11 @@ def revise_file(
 
                     # revise and write paragraph to output file
                     self.revise_and_write_paragraph(
-                        paragraph, revision_model, section_name, outfile
+                        paragraph,
+                        revision_model,
+                        section_name,
+                        resolved_prompt=resolved_prompt,
+                        outfile=outfile,
                     )
 
                     # clear the paragraph list
@@ -418,7 +430,11 @@ def revise_file(
             # output file
             if paragraph:
                 self.revise_and_write_paragraph(
-                    paragraph, revision_model, section_name, outfile
+                    paragraph,
+                    revision_model,
+                    section_name,
+                    resolved_prompt=resolved_prompt,
+                    outfile=outfile,
                 )
 
     def revise_manuscript(
@@ -446,14 +462,22 @@ def revise_manuscript(
                 filenames_to_revise = None
 
         for filename in sorted(self.content_dir.glob("*.md")):
-            # ignore front-matter file
-            if "front-matter" in filename.name:
-                continue
-
             filename_section = self.get_section_from_filename(filename.name)
 
-            # we do not process the file if it has no section and there is no custom prompt
-            if filename_section is None and (
+            # use the ai_revision prompt config to attempt to resolve a prompt
+            resolved_prompt, _ = self.prompt_config.get_prompt_for_filename(
+                filename.name
+            )
+
+            # ignore the file if the ai-revision_* config files told us to
+            if resolved_prompt == IGNORE_FILE:
+                continue
+
+            # we do not process the file if all hold:
+            # 1. it has no section *or* resolved prompt
+            # 2. we're unable to resolve it via ai_revision prompt configuration
+            # 2. there is no custom prompt
+            if (filename_section is None and resolved_prompt is None) and (
                 env_vars.CUSTOM_PROMPT not in os.environ
                 or os.environ[env_vars.CUSTOM_PROMPT].strip() == ""
             ):
@@ -472,4 +496,5 @@ def revise_manuscript(
                 output_dir,
                 revision_model,
                 section_name=filename_section,
+                resolved_prompt=resolved_prompt,
             )