Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow-specific performance metrics #75

Open
LuiggiTenorioK opened this issue May 6, 2024 · 13 comments
Open

Workflow-specific performance metrics #75

LuiggiTenorioK opened this issue May 6, 2024 · 13 comments
Assignees

Comments

@LuiggiTenorioK
Copy link
Member

In GitLab by @mcastril on May 6, 2024, 10:55

There are some performance metrics that cannot be easily calculated from the current parameters that Autosubmit stores in the DDBB in a homogeneous way across different workflows. Some examples are the Coupling Cost or the Complexity. In the past, we agreed to enable a mechanism by which the workflow would be responsible for generating this data, which should be provided by Autosubmit/Autosubmit-API in the way that we decide to be available as an endpoint in the API.

For instance, the workflow can provide a YAML file written in a predetermined path (we could make it fixed or have a parameter in the configuration to allow the users to change this path) that the Autosubmit API would check when the endpoint is reached.

As we are moving to a different DDBB backend and trying to make Autosubmit less dependent on shared files in the filesystem, I understand this can be an issue. We could decide if it's worth having a field in the DDBB to store these additional metrics in bulk by Autosubmit (then it would be Autosubmit, and not the API the one that consumes the workflow file).

This development has not moved forward in the last few years due to the lack of a real necessity, but it's a requirement from DestinE in Phase 2 (August).

Performance metrics: https://docs.google.com/document/d/12yWDwXsohf4G4MPeP6e3Eil4ZL-YeIN71dBcoWRliEg/edit

Previous decisions about how to implement this:

https://earth.bsc.es/gitlab/es/autosubmit/-/issues/674#note_160254
https://earth.bsc.es/gitlab/es/autosubmit/-/issues/524#note_90901

CC @kinow @dbeltrankyl

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @kinow on May 6, 2024, 11:00

@mcastril just to check on this part

This development has not moved forward in the last few years due to the lack of a real necessity, but it's a requirement from DestinE in Phase 2 (August).

August, 2024, right?

Thanks for the details in the issue description!

@LuiggiTenorioK
Copy link
Member Author

In the past, we agreed to enable a mechanism by which the workflow would be responsible for generating this data, which should be provided by Autosubmit/Autosubmit-API in the way that we decide to be available as an endpoint in the API.

For instance, the workflow can provide a YAML file written in a predetermined path (we could make it fixed or have a parameter in the configuration to allow the users to change this path) that the Autosubmit API would check when the endpoint is reached.

+1. I think this feature could be more generalized. @kinow once mentioned something about having user-defined metadata fields in the workflows that can be shown in the API/GUI as well. This metadata could include fields that are values or references (paths) to the content of another file.

We could decide if it's worth having a field in the DDBB to store these additional metrics in bulk by Autosubmit (then it would be Autosubmit, and not the API the one that consumes the workflow file).

I think we must have the metrics, metadata, or source data to calculate them stored in the DDBB. This will allow us to have a historical trace of the metrics per run.

@LuiggiTenorioK
Copy link
Member Author

Issue about the metadata feature: https://earth.bsc.es/gitlab/es/autosubmit-gui/-/issues/99

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @mcastril on May 6, 2024, 11:31

Thank you for the positive feedback Luiggi.

This metadata could include fields that are values or references (paths) to the content of another file.

Concerning this, if the fields are references to actual data or metadata, maybe the field itself should be treated as data (an ordinary parameter) instead of metadata.

@LuiggiTenorioK LuiggiTenorioK self-assigned this Nov 12, 2024
@mcastril
Copy link

mcastril commented Dec 3, 2024

We have to recover this issue.

Dev has an example of a new metric to implement.

https://earth.bsc.es/gitlab/digital-twins/de_340-2/workflow/-/issues/745#note_325880

For this kind of cases we have different options:

  • Implement a function (code) in the workflow and make the API run this code to generate the information anytime. This may be easier when the metric only depends on the configuration of the experiment, like in the example. However, when it depends on data, it may be difficult if the API has no access to this data, like in most of the cases.

  • Let the workflow function generate a JSON file and the API consume this JSON file.

We could also admit both approaches if they are well documented.

We also need a way to indicate that this information is a Performance Metric and implement the logic in the GUI so it displays also this info

@LuiggiTenorioK
Copy link
Member Author

  • Implement a function (code) in the workflow and make the API run this code to generate the information anytime. This may be easier when the metric only depends on the configuration of the experiment, like in the example. However, when it depends on data, it may be difficult if the API has no access to this data, like in most of the cases.

I think this might be outside the scope of the API. If the execution is done in the same context of the API, it will be open to many vulnerabilities and code injection.

  • Let the workflow function generate a JSON file and the API consume this JSON file.

We also need a way to indicate that this information is a Performance Metric and implement the logic in the GUI so it displays also this info

I think this is the safer option. Other workflow managers usually define their outputs at workflow or job level like galaxy. Maybe we can define them in the configuration YAMLs like:

EXPERIMENT:
  OUTPUTS:
    - PATH: %ROOTDIR%/outputs/output.json
      NAME: metrics_file
      LABELS:
        type: performance
        model: NEMO

Then, what is defined there can be tracked and shown in the GUI in "file explorer"-like view. Also, as a safety measure, I'll suggest only allowing files below the <expid>/outputs/ directory or similar.

@LuiggiTenorioK
Copy link
Member Author

With the current state of the API, I think it is possible to get that data if it is printed in stdout of one job:

image

The only condition is that the data should be in the output log's latest 150 lines (unfortunately hardcoded). Then, it should be obtainable by requesting to that endpoint and using regex.

@kinow
Copy link
Member

kinow commented Dec 4, 2024

I think this is the safer option. Other workflow managers usually define their outputs at workflow or job level like galaxy. Maybe we can define them in the configuration YAMLs like:

Having outputs defined like this would be useful for RO-Crate too (we define outputs and inputs there, but under ROCRATE as these are not explicit in Autosubmit's config).

The only condition is that the data should be in the output log's latest 150 lines (unfortunately hardcoded). Then, it should be obtainable by requesting to that endpoint and using regex.

I think this is the fastest, although probably a bit brittle. If we need it asap, then +1 for trying this, as long as we warn it's a temporary solution.

Otherwise, I'd say some brainstorming sessions around the EXPERIMENT.OUTPUTS or something else based on what @mcastril suggested would be the way forward 👍

@mcastril
Copy link

mcastril commented Dec 4, 2024

I agree with the approach, but I am more in favour of asking for a precise format (one self-contained file with information in JSON, for example, and nothing else) for the performance metrics, instead of parsing it from the .out file. Since the .out may contain other information coming from the scheduler, parallel environment, or the script itself.

@LuiggiTenorioK
Copy link
Member Author

Following the new proposed feature, the user flow might be like this:

  1. User specifies an experiment-level (or should be by job-level?) output file:
EXPERIMENT:
  OUTPUTS:
    - PATH: output.json
      NAME: metrics_file
      CONTENT:
          TYPE: JSON
      LABELS:
        type: performance
        model: NEMO

The idea is that output should be written at %ROOTDIR%/<expid>/outputs/ directory. So, a check should be implemented to prevent access to restrained resources.

  1. Then, it is expected that at least one job should write to these defined paths as the responsibility of the workflow developer. In the above example, the file %ROOTDIR%/<expid>/outputs/output.json should be generated. Note: It is expected that these files will not be so big.
  2. The API will list the outputs on GET /v4/experiments/<expid>/outputs endpoint. and content/metadata will be retrieved using GET /v4/experiments/<expid>/outputs/<output_name> endpoint. Metadata should infer content type if not specified by the user.

Now, some points to discuss are:

  • Is preferred that these outputs be defined at the experiment or job level?
  • Should the outputs be versioned by run? Maybe we can have %ROOTDIR%/<expid>/outputs/<run_id>/ directories for that instead. However, this could be complex since the job should always know the run_id and be responsible for writing in that exact path. Alternatively, it will be complex from our side if we have to move the file to the versioned directory since we will need to know when the file is ready to be moved.
  • This feature is aimed at giving file-like outputs to the users, which will allow them to have images or reports as outputs. Do you think this might be an overkill? Maybe we can constrain it to just stick to string outputs that can be directly retrieved from TXT or JSON files and keep it extensible for the future.

@mcastril
Copy link

User specifies an experiment-level (or should be by job-level?) output file:
Is preferred that these outputs be defined at the experiment or job level?
Good point. May we have both options?

Should the outputs be versioned by run? Maybe we can have %ROOTDIR%//outputs/<run_id>/ directories for that instead. However, this could be complex since the job should always know the run_id and be responsible for writing in that exact path. Alternatively, it will be complex from our side if we have to move the file to the versioned directory since we will need to know when the file is ready to be moved.

We can start with a simple, non-versioned implementation and only implement versioning if it's needed

This feature is aimed at giving file-like outputs to the users, which will allow them to have images or reports as outputs. Do you think this might be an overkill? Maybe we can constrain it to just stick to string outputs that can be directly retrieved from TXT or JSON files and keep it extensible for the future.

I would only add TXT and JSON capability for now. We should keep it simple for this first version, although all these are very good points.

Importantly, I understand that the API will need all this information locally. Then, Autosubmit should know that it has to retrieve these files as it does with the logs. Related to the formats and sizes, Autosubmit should have checkers for those, to ensure not copying large files from the remote.

@LuiggiTenorioK
Copy link
Member Author

Good point. May we have both options?

Maybe we can start at the experiment level. Since the number of jobs can be potentially high, it can lead us to unwanted issues depending on the final implementation.

We can start with a simple, non-versioned implementation and only implement versioning if it's needed

I would only add TXT and JSON capability for now. We should keep it simple for this first version, although all these are very good points.

+1

Importantly, I understand that the API will need all this information locally. Then, Autosubmit should know that it has to retrieve these files as it does with the logs. Related to the formats and sizes, Autosubmit should have checkers for those, to ensure not copying large files from the remote.

This is another good point. Should Autosubmit will be responsible for moving these files from remote to local? Then, we will need to make the implementation in Autosubmit too.

@mcastril
Copy link

Maybe we can start at the experiment level. Since the number of jobs can be potentially high, it can lead us to unwanted issues depending on the final implementation.

We can start with the experiment level but I think we'll need the job level soon. You can see that CSC is already proposing a metric that has more sense in this aggregation level, since it's not static and the information can vary between jobs.

This is another good point. Should Autosubmit will be responsible for moving these files from remote to local? Then, we will need to make the implementation in Autosubmit too.

I think so. I think that it should be part of the log retrieval mechanism.

If the number of files is an issue, it should be an issue for Autosubmit. If we are concerned about the amount of files that the API should have to check, then we should only work with files at the Autosubmit level. Then Autosubmit could insert this information in the DDBB for the API.

I guess that we will need a new column (in which we can store JSON outputs as value, using a experiment/job hierarchy, since the amount of metrics can vary) unless you wanted new tables for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants