Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preliminary api client and parsl apps to export data and start training task #25

Draft
wants to merge 32 commits into
base: develop
Choose a base branch
from

Conversation

rlskoeser
Copy link
Contributor

@rlskoeser rlskoeser commented Nov 5, 2024

in relation to #24

@rlskoeser rlskoeser marked this pull request as draft November 5, 2024 23:05
creation_time: datetime.datetime,
) -> str:
"""Generate URL for an export file generated by :meth:`document_export`"""
# is there any way get current user id from api based purely on token?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# is there any way get current user id from api based purely on token?

Yes, the following API URL returns the current user only: https://test-htr.lib.princeton.edu/api/users/current/

Copy link
Contributor Author

@rlskoeser rlskoeser Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thank you! How did you know that? Is there other documentation / code I should be looking at to understand the API better?

(I see the method in the escriptorium view code now. I guess I'll have to remember to keep referencing the api views.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was viewing https://test-htr.lib.princeton.edu/api/users/ in the browser (mostly to check whether I was right to think that regular users would only ever see themselves, but admins would see everyone) when I noticed Current under the Extra Actions. So unfortunately no extra documentation I can point you to, but there are some helpful methods we can keep an eye out for!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh! I hadn't noticed that extra actions yet! That is good to know about.

I see now that there's a documents tasks api endpoint, maybe it would be better to use that one for getting the export status...

# WORKFLOW_STATE_ERROR = 2
# WORKFLOW_STATE_DONE = 3
# WORKFLOW_STATE_CANCELED = 4
# add a property?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially. Something like WORKFLOW_STATE_CLUSTER to indicate that a task has been sent to the cluster? I'm assuming we will not be monitoring and passing data back and forth to such an extent as to track when a slurm task is still queued vs when it is running.

resp = self._make_request(api_url)
return ResultsList(result_type="model", **resp.json())

def update_model(self, model_id: int, model_file: pathlib.Path):
Copy link
Collaborator

@cmroughan cmroughan Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is overwriting an existing model file in eScriptorium with the newly trained model. I think that should work for when the user clicks the "Overwrite existing model" checkbox. But otherwise, we will want to use a POST method to get a new and separate model into the user's model list.

When POSTing a new model, the required data for it to upload is the model file, its name, and its job (Segment or Recognize). To display properly in eScriptorium we also want its accuracy_percent.

The later is in the mlmodel file, under kraken_meta and metrics. It's going to be the last value in the accuracy list (since the mlmodel file tracks all accuracies of prior epochs as well). I know you can access it via Python using something like the following:

from kraken.lib import vgsl
m = vgsl.TorchVGSLModel.load_model(PATH_TO_MODEL)
m.user_metadata['accuracy'][-1][-1]*100

or:

import coremltools
import json
m = coremltools.models.MLModel(PATH_TO_MODEL)
meta = json.loads(m.get_spec().description.metadata.userDefined['kraken_meta'])
meta['accuracy'][-1][-1]*100

Unfortunately, I was testing api model imports and trying to make them display the accuracy but this was not working, and I think it is because the "accuracy_percent" field is set to read_only in the eScriptorium API...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for POSTing a new model we'd want something like this, they're easy enough changes to make. The problem just remains that I can't get accuracy_percent to update. (The upload goes through, it just means there's no recorded accuracy in the model list which is... not ideal.)

    def upload_model(self, model_file: pathlib.Path, model_name: str, model_job: str):
        """Upload the new model file to a user's list of models."""
        api_url = f"models/"
        with open(model_file, "rb") as mfile:
            files = {"file": mfile}
            data = {
                "name": model_name, # required, get from user
                "job": model_job,  # required, either "Segment" or "Recognize", get from user
                "accuracy_percent": 0.0, # 0.0 by default; can be found in the mlmodel file
                # file_size (int); not required when POSTing with API, will be automatically filled in
                # versions ? does not seem to be used
            }
            resp = self._make_request(api_url, method="POST", files=files, data=data)
        # on successful update, returns the model object
        return to_namedtuple("model", resp.json())

"file_format": "alto",
"include_images": True,
"region_types": block_types,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easy enough to add
"parts": parts_list,

And add parts_list as input to the function -- if we are training on (and therefore exporting) the full doc we don't need it, but if we are only doing a selection of images then this will let us handle that.

"""details for a single task"""
api_url = f"tasks/{task_id}/"
resp = self._make_request(api_url)
return to_namedtuple("task", resp.json())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will also want functions to locate and download model files, if a user desires to train on top of a given model. For transcription, a user can either train on top of a given transcription model or from scratch. For segmentation, a user can either train on top of a given segmentation model or "from scratch", EXCEPT here "from scratch" actually means "on top of the base blla.mlmodel that comes as the default segmenter in kraken". So we need both a way of downloading a selected model and a way of passing the blla.mlmodel along with our data.

Fortunately, given a MODEL_PK, we can simply use BASE_URL/api/models/MODEL_PK/ and then get the URL to download the model in the "file" field. requests.get() should then work just fine.

For blla.mlmodel, we can just grab it from the kraken install, or store a copy somewhere.

src/htr2hpc/train.py Outdated Show resolved Hide resolved
args.document_id,
args.transcription,
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably add logic like: if args.model exists, then run a prep_model() function taking that MODEL_PK as input and saving the model as model_data or something. That would handle both cases of segmentation and transcription training where the user chooses to finetune on an existing model.

To handle segmentation training "from scratch" (which is actually on top of the default blla.mlmodel), we would want logic like: if args.model does not exist and args.job == 'segmentation', then load in blla.mlmodel instead.


# return training directory as a parsl file
return File(data_dir)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably also have a prep_model() function, taking as input es_base_url, es_api_token, model_id.



@bash_app
def segtrain(inputs=[], outputs=[]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could put both transcription and segmentation training in one function, but we probably want a separate train function for transcription training. I'm laying out the usual eScriptorium train/segtrain commands as they would be formatted in command-line kraken below.

  • A command for transcription training from scratch:

    • ketos train --epochs 50 -o model --workers HOW_MANY_CPUS -d cuda:0 -f alto {input_data_dir}/*.xml
    • (For workers, use the number of CPUs requested in the slurm job.)
  • A command for transcription refining on another model:

    • ketos train --epochs 50 --resize union -i {path_to_starting_model} -o model --workers HOW_MANY_CPUS -d cuda:0 -f alto {input_data_dir}/*.xml
  • A command for segmentation training "from scratch":

    • ketos segtrain --epochs 50 --resize both -i {path_to_blla_mlmodel} -o model --workers HOW_MANY_CPUS -d cuda:0 -f xml {input_data_dir}/*.xml
  • A command for segmentation refining on another model:

    • ketos segtrain --epochs 50 --resize both -i {path_to_starting_model} -o model --workers HOW_MANY_CPUS -d cuda:0 -f xml {input_data_dir}/*.xml

api_url = "users/current/"
return to_namedtuple("user", self._make_request(api_url).json())

def models(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laurejt I've been thinking it would be better to differentiate related methods more clearly, e.g. list_models, get_model, etc. Would appreciate your thoughts.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds like a good idea. Giving methods names that are more descriptive of their function should hopefully make it easier to use.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe get_model_list?

Comment on lines 22 to 23
data_dir = pathlib.Path("training_data")
data_dir.mkdir(exist_ok=True)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable enough, but just wanted to note that if there's any chance of a training_data file to exist, this will raise a FileExistsError error.

src/htr2hpc/train_apps.py Outdated Show resolved Hide resolved
}


def to_namedtuple(name: str, data: any):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps moved to the top of file since ResultsList also depends on to_namedtuple.

api_url = "users/current/"
return to_namedtuple("user", self._make_request(api_url).json())

def models(self):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds like a good idea. Giving methods names that are more descriptive of their function should hopefully make it easier to use.

from collections import namedtuple
from dataclasses import dataclass
from time import sleep
import pathlib
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps only import Path

Suggested change
import pathlib
from pathlib import Path

# this will match the completion time of the task...
base_filename = "export_doc%d_%s_%s_%s" % (
document_id,
slugify(document_name).replace("-", "_")[:32],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this in more detail? Using slugify seems like overkill, or at the very least confirming which characters are being removed and/or replaced.

# import.export.BaseExporter logic
# NOTE2: escriptorium code uses datetime.now() so there's no guarantee
# this will match the completion time of the task...
base_filename = "export_doc%d_%s_%s_%s" % (
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider replacing with an f-string

Comment on lines +286 to +287
# TODO: confirm correct task id based on method (export) and document name
export_task = task_list.results[0]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you only need the first item from the results list?

Comment on lines +40 to +50
pk: int
document: int
document_part: int
workflow_state: int
label: str
messages: str
queued_at: datetime.datetime
started_at: datetime.datetime
done_at: datetime.datetime
method: str
user: int
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps worth some comments, especially for the cases where these are identifiers rather than contents or filenames.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it might be worth renaming document to document_id

Comment on lines 45 to 47
parser.add_argument(
"-p", "--parts", help="Optional list of parts to export (if not entire document)"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this component has been worked out yet, but I'm guessing this may use nargs=+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants