Preliminary api client and parsl apps to export data and start training task #25

rlskoeser · 2024-11-05T23:05:06Z

in relation to #24

cmroughan · 2024-11-06T14:58:28Z

src/htr2hpc/api_client.py

+        creation_time: datetime.datetime,
+    ) -> str:
+        """Generate URL for an export file generated by :meth:`document_export`"""
+        # is there any way get current user id from api based purely on token?


# is there any way get current user id from api based purely on token?

Yes, the following API URL returns the current user only: https://test-htr.lib.princeton.edu/api/users/current/

Ah, thank you! How did you know that? Is there other documentation / code I should be looking at to understand the API better?

(I see the method in the escriptorium view code now. I guess I'll have to remember to keep referencing the api views.)

Yes, I was viewing https://test-htr.lib.princeton.edu/api/users/ in the browser (mostly to check whether I was right to think that regular users would only ever see themselves, but admins would see everyone) when I noticed Current under the Extra Actions. So unfortunately no extra documentation I can point you to, but there are some helpful methods we can keep an eye out for!

Ooh! I hadn't noticed that extra actions yet! That is good to know about.

I see now that there's a documents tasks api endpoint, maybe it would be better to use that one for getting the export status...

cmroughan · 2024-11-11T03:54:35Z

src/htr2hpc/api_client.py

+    # WORKFLOW_STATE_ERROR = 2
+    # WORKFLOW_STATE_DONE = 3
+    # WORKFLOW_STATE_CANCELED = 4
+    # add a property?


Potentially. Something like WORKFLOW_STATE_CLUSTER to indicate that a task has been sent to the cluster? I'm assuming we will not be monitoring and passing data back and forth to such an extent as to track when a slurm task is still queued vs when it is running.

cmroughan · 2024-11-11T03:56:52Z

src/htr2hpc/api_client.py

+        resp = self._make_request(api_url)
+        return ResultsList(result_type="model", **resp.json())
+
+    def update_model(self, model_id: int, model_file: pathlib.Path):


This function is overwriting an existing model file in eScriptorium with the newly trained model. I think that should work for when the user clicks the "Overwrite existing model" checkbox. But otherwise, we will want to use a POST method to get a new and separate model into the user's model list.

When POSTing a new model, the required data for it to upload is the model file, its name, and its job (Segment or Recognize). To display properly in eScriptorium we also want its accuracy_percent.

The later is in the mlmodel file, under kraken_meta and metrics. It's going to be the last value in the accuracy list (since the mlmodel file tracks all accuracies of prior epochs as well). I know you can access it via Python using something like the following:

from kraken.lib import vgsl m = vgsl.TorchVGSLModel.load_model(PATH_TO_MODEL) m.user_metadata['accuracy'][-1][-1]*100

or:

import coremltools import json m = coremltools.models.MLModel(PATH_TO_MODEL) meta = json.loads(m.get_spec().description.metadata.userDefined['kraken_meta']) meta['accuracy'][-1][-1]*100

Unfortunately, I was testing api model imports and trying to make them display the accuracy but this was not working, and I think it is because the "accuracy_percent" field is set to read_only in the eScriptorium API...

I think for POSTing a new model we'd want something like this, they're easy enough changes to make. The problem just remains that I can't get accuracy_percent to update. (The upload goes through, it just means there's no recorded accuracy in the model list which is... not ideal.)

def upload_model(self, model_file: pathlib.Path, model_name: str, model_job: str): """Upload the new model file to a user's list of models.""" api_url = f"models/" with open(model_file, "rb") as mfile: files = {"file": mfile} data = { "name": model_name, # required, get from user "job": model_job, # required, either "Segment" or "Recognize", get from user "accuracy_percent": 0.0, # 0.0 by default; can be found in the mlmodel file # file_size (int); not required when POSTing with API, will be automatically filled in # versions ? does not seem to be used } resp = self._make_request(api_url, method="POST", files=files, data=data) # on successful update, returns the model object return to_namedtuple("model", resp.json())

cmroughan · 2024-11-11T03:58:56Z

src/htr2hpc/api_client.py

+            "file_format": "alto",
+            "include_images": True,
+            "region_types": block_types,
+        }


Easy enough to add
"parts": parts_list,

And add parts_list as input to the function -- if we are training on (and therefore exporting) the full doc we don't need it, but if we are only doing a selection of images then this will let us handle that.

cmroughan · 2024-11-11T04:00:09Z

src/htr2hpc/api_client.py

+        """details for a single task"""
+        api_url = f"tasks/{task_id}/"
+        resp = self._make_request(api_url)
+        return to_namedtuple("task", resp.json())


We will also want functions to locate and download model files, if a user desires to train on top of a given model. For transcription, a user can either train on top of a given transcription model or from scratch. For segmentation, a user can either train on top of a given segmentation model or "from scratch", EXCEPT here "from scratch" actually means "on top of the base blla.mlmodel that comes as the default segmenter in kraken". So we need both a way of downloading a selected model and a way of passing the blla.mlmodel along with our data.

Fortunately, given a MODEL_PK, we can simply use BASE_URL/api/models/MODEL_PK/ and then get the URL to download the model in the "file" field. requests.get() should then work just fine.

For blla.mlmodel, we can just grab it from the kraken install, or store a copy somewhere.

src/htr2hpc/train.py

cmroughan · 2024-11-11T04:06:15Z

src/htr2hpc/train.py

+        args.document_id,
+        args.transcription,
+    )
+


Probably add logic like: if args.model exists, then run a prep_model() function taking that MODEL_PK as input and saving the model as model_data or something. That would handle both cases of segmentation and transcription training where the user chooses to finetune on an existing model.

To handle segmentation training "from scratch" (which is actually on top of the default blla.mlmodel), we would want logic like: if args.model does not exist and args.job == 'segmentation', then load in blla.mlmodel instead.

cmroughan · 2024-11-11T04:06:54Z

src/htr2hpc/train_apps.py

+
+    # return training directory as a parsl file
+    return File(data_dir)
+


Should probably also have a prep_model() function, taking as input es_base_url, es_api_token, model_id.

cmroughan · 2024-11-11T04:10:40Z

src/htr2hpc/train_apps.py

+
+
+@bash_app
+def segtrain(inputs=[], outputs=[]):


We could put both transcription and segmentation training in one function, but we probably want a separate train function for transcription training. I'm laying out the usual eScriptorium train/segtrain commands as they would be formatted in command-line kraken below.

A command for transcription training from scratch:

ketos train --epochs 50 -o model --workers HOW_MANY_CPUS -d cuda:0 -f alto {input_data_dir}/*.xml

(For workers, use the number of CPUs requested in the slurm job.)

A command for transcription refining on another model:

ketos train --epochs 50 --resize union -i {path_to_starting_model} -o model --workers HOW_MANY_CPUS -d cuda:0 -f alto {input_data_dir}/*.xml

A command for segmentation training "from scratch":

ketos segtrain --epochs 50 --resize both -i {path_to_blla_mlmodel} -o model --workers HOW_MANY_CPUS -d cuda:0 -f xml {input_data_dir}/*.xml

A command for segmentation refining on another model:

ketos segtrain --epochs 50 --resize both -i {path_to_starting_model} -o model --workers HOW_MANY_CPUS -d cuda:0 -f xml {input_data_dir}/*.xml

rlskoeser · 2024-11-11T18:14:53Z

src/htr2hpc/api_client.py

+        api_url = "users/current/"
+        return to_namedtuple("user", self._make_request(api_url).json())
+
+    def models(self):


@laurejt I've been thinking it would be better to differentiate related methods more clearly, e.g. list_models, get_model, etc. Would appreciate your thoughts.

Yes, that sounds like a good idea. Giving methods names that are more descriptive of their function should hopefully make it easier to use.

Maybe get_model_list?

laurejt · 2024-11-06T15:27:10Z

src/htr2hpc/train_apps.py

+    data_dir = pathlib.Path("training_data")
+    data_dir.mkdir(exist_ok=True)


Seems reasonable enough, but just wanted to note that if there's any chance of a training_data file to exist, this will raise a FileExistsError error.

src/htr2hpc/train_apps.py

laurejt · 2024-11-11T18:11:12Z

src/htr2hpc/api_client.py

+}
+
+
+def to_namedtuple(name: str, data: any):


Perhaps moved to the top of file since ResultsList also depends on to_namedtuple.

laurejt · 2024-11-11T18:17:25Z

src/htr2hpc/api_client.py

+        api_url = "users/current/"
+        return to_namedtuple("user", self._make_request(api_url).json())
+
+    def models(self):


Yes, that sounds like a good idea. Giving methods names that are more descriptive of their function should hopefully make it easier to use.

laurejt · 2024-11-11T18:20:36Z

src/htr2hpc/api_client.py

+from collections import namedtuple
+from dataclasses import dataclass
+from time import sleep
+import pathlib


Perhaps only import Path

Suggested change

import pathlib

from pathlib import Path

laurejt · 2024-11-11T18:42:58Z

src/htr2hpc/api_client.py

+        # this will match the completion time of the task...
+        base_filename = "export_doc%d_%s_%s_%s" % (
+            document_id,
+            slugify(document_name).replace("-", "_")[:32],


Can you explain this in more detail? Using slugify seems like overkill, or at the very least confirming which characters are being removed and/or replaced.

laurejt · 2024-11-11T18:43:40Z

src/htr2hpc/api_client.py

+        # import.export.BaseExporter logic
+        # NOTE2: escriptorium code uses datetime.now() so there's no guarantee
+        # this will match the completion time of the task...
+        base_filename = "export_doc%d_%s_%s_%s" % (


Consider replacing with an f-string

laurejt · 2024-11-11T18:45:42Z

src/htr2hpc/api_client.py

+            # TODO: confirm correct task id based on method (export) and document name
+            export_task = task_list.results[0]


Do you only need the first item from the results list?

laurejt · 2024-11-11T18:48:51Z

src/htr2hpc/api_client.py

+    pk: int
+    document: int
+    document_part: int
+    workflow_state: int
+    label: str
+    messages: str
+    queued_at: datetime.datetime
+    started_at: datetime.datetime
+    done_at: datetime.datetime
+    method: str
+    user: int


Perhaps worth some comments, especially for the cases where these are identifiers rather than contents or filenames.

It seems like it might be worth renaming document to document_id

laurejt · 2024-11-11T18:50:27Z

src/htr2hpc/train.py

+    parser.add_argument(
+	"-p", "--parts", help="Optional list of parts to export (if not entire document)"
+    )


I'm guessing this component has been worked out yet, but I'm guessing this may use nargs=+

Co-authored-by: Laure Thompson <[email protected]>

rlskoeser added 5 commits November 5, 2024 13:41

Preliminary escriptorium api client

16aca6e

Custom result class for API Task information

486e777

Add api method to update model file in escriptorium

c76540e

Add method to monitor export and download resulting file

6344e63

Preliminary parsl apps for download and train, preliminary train script

684a5cc

rlskoeser requested a review from cmroughan November 5, 2024 23:05

rlskoeser marked this pull request as draft November 5, 2024 23:05

cmroughan reviewed Nov 6, 2024

View reviewed changes

rlskoeser requested a review from laurejt November 6, 2024 15:07

rlskoeser added 2 commits November 6, 2024 11:28

Use current user api endpoint and simplify required parameters

e3bf0ad

Improve error handling / reporting

5190854

cmroughan reviewed Nov 11, 2024

View reviewed changes

suggested additional arguments for train.py

d669d68

rlskoeser commented Nov 11, 2024

View reviewed changes

laurejt reviewed Nov 11, 2024

View reviewed changes

rlskoeser and others added 15 commits November 12, 2024 17:43

Update src/htr2hpc/train_apps.py

2329224

Co-authored-by: Laure Thompson <[email protected]>

Update src/htr2hpc/api_client.py

c9190de

Co-authored-by: Laure Thompson <[email protected]>

Update src/htr2hpc/api_client.py

424764e

Co-authored-by: Laure Thompson <[email protected]>

Minor cleanup on api and preliminary training code

c35d805

API methods for models, document parts, and util method to download file

4dd7550

Create training data directly from document parts api response

35671d5

Make api methods more descriptive & distinct

78fa21d

Improve download filename logic

334cab2

Add & use working directory param; document other needed parameters

516e0a7

Clean up with ruff

01fe426

Restructure parsl app/train code; configure package script for training

d3a7cb9

Adjust slurm/srun partition and other configurations

d8abafe

Revise slurm provider configuration

6d50a8d

Update segtrain command with paths for output model and logs

9c66fcf

Configure parsl app executor at the right level

57f1b85

rlskoeser added 9 commits November 14, 2024 15:11

Adjust paths; use absolute paths for training task

cce97b0

Calculate absolute paths before changing working directory

e248479

Add logic for default kraken models if model id is not specified

f03aeee

Add api method and notes for getting transcription text content

8f325a6

Adjust arguments for the two modes of training

fd1cc34

Remove log dir option since nothing is getting logged

005ff36

Preliminary instructions for installing and running htr2hpc-train

32b0ab1

Increase walltime in hpc parsl config

af3dc25

Update to use new subparser mode instead of old job arg

a47a143

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preliminary api client and parsl apps to export data and start training task #25

Preliminary api client and parsl apps to export data and start training task #25

rlskoeser commented Nov 5, 2024 •

edited

Loading

cmroughan Nov 6, 2024

rlskoeser Nov 6, 2024 •

edited

Loading

cmroughan Nov 6, 2024

rlskoeser Nov 6, 2024

cmroughan Nov 11, 2024

cmroughan Nov 11, 2024 •

edited

Loading

cmroughan Nov 11, 2024

cmroughan Nov 11, 2024

cmroughan Nov 11, 2024

cmroughan Nov 11, 2024

cmroughan Nov 11, 2024

cmroughan Nov 11, 2024

rlskoeser Nov 11, 2024

laurejt Nov 11, 2024

laurejt Nov 11, 2024

laurejt Nov 6, 2024

laurejt Nov 11, 2024

laurejt Nov 11, 2024

laurejt Nov 11, 2024

laurejt Nov 11, 2024

laurejt Nov 11, 2024

laurejt Nov 11, 2024

laurejt Nov 11, 2024

laurejt Nov 11, 2024

laurejt Nov 11, 2024


		# return training directory as a parsl file
		return File(data_dir)

		data_dir = pathlib.Path("training_data")
		data_dir.mkdir(exist_ok=True)

		# TODO: confirm correct task id based on method (export) and document name
		export_task = task_list.results[0]

Preliminary api client and parsl apps to export data and start training task #25

Are you sure you want to change the base?

Preliminary api client and parsl apps to export data and start training task #25

Conversation

rlskoeser commented Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

rlskoeser Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmroughan Nov 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlskoeser commented Nov 5, 2024 •

edited

Loading

rlskoeser Nov 6, 2024 •

edited

Loading

cmroughan Nov 11, 2024 •

edited

Loading