-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow-specific performance metrics #75
Comments
In GitLab by @kinow on May 6, 2024, 11:00 @mcastril just to check on this part
August, 2024, right? Thanks for the details in the issue description! |
+1. I think this feature could be more generalized. @kinow once mentioned something about having user-defined metadata fields in the workflows that can be shown in the API/GUI as well. This metadata could include fields that are values or references (paths) to the content of another file.
I think we must have the metrics, metadata, or source data to calculate them stored in the DDBB. This will allow us to have a historical trace of the metrics per run. |
Issue about the metadata feature: https://earth.bsc.es/gitlab/es/autosubmit-gui/-/issues/99 |
In GitLab by @mcastril on May 6, 2024, 11:31 Thank you for the positive feedback Luiggi.
Concerning this, if the fields are references to actual data or metadata, maybe the field itself should be treated as data (an ordinary parameter) instead of metadata. |
We have to recover this issue. Dev has an example of a new metric to implement. https://earth.bsc.es/gitlab/digital-twins/de_340-2/workflow/-/issues/745#note_325880 For this kind of cases we have different options:
We could also admit both approaches if they are well documented. We also need a way to indicate that this information is a Performance Metric and implement the logic in the GUI so it displays also this info |
I think this might be outside the scope of the API. If the execution is done in the same context of the API, it will be open to many vulnerabilities and code injection.
I think this is the safer option. Other workflow managers usually define their outputs at workflow or job level like galaxy. Maybe we can define them in the configuration YAMLs like: EXPERIMENT:
OUTPUTS:
- PATH: %ROOTDIR%/outputs/output.json
NAME: metrics_file
LABELS:
type: performance
model: NEMO Then, what is defined there can be tracked and shown in the GUI in "file explorer"-like view. Also, as a safety measure, I'll suggest only allowing files below the |
With the current state of the API, I think it is possible to get that data if it is printed in stdout of one job: The only condition is that the data should be in the output log's latest 150 lines (unfortunately hardcoded). Then, it should be obtainable by requesting to that endpoint and using regex. |
Having outputs defined like this would be useful for RO-Crate too (we define outputs and inputs there, but under
I think this is the fastest, although probably a bit brittle. If we need it asap, then +1 for trying this, as long as we warn it's a temporary solution. Otherwise, I'd say some brainstorming sessions around the |
I agree with the approach, but I am more in favour of asking for a precise format (one self-contained file with information in JSON, for example, and nothing else) for the performance metrics, instead of parsing it from the .out file. Since the .out may contain other information coming from the scheduler, parallel environment, or the script itself. |
Following the new proposed feature, the user flow might be like this:
EXPERIMENT:
OUTPUTS:
- PATH: output.json
NAME: metrics_file
CONTENT:
TYPE: JSON
LABELS:
type: performance
model: NEMO The idea is that output should be written at
Now, some points to discuss are:
|
We can start with a simple, non-versioned implementation and only implement versioning if it's needed
I would only add TXT and JSON capability for now. We should keep it simple for this first version, although all these are very good points. Importantly, I understand that the API will need all this information locally. Then, Autosubmit should know that it has to retrieve these files as it does with the logs. Related to the formats and sizes, Autosubmit should have checkers for those, to ensure not copying large files from the remote. |
Maybe we can start at the experiment level. Since the number of jobs can be potentially high, it can lead us to unwanted issues depending on the final implementation.
+1
This is another good point. Should Autosubmit will be responsible for moving these files from remote to local? Then, we will need to make the implementation in Autosubmit too. |
We can start with the experiment level but I think we'll need the job level soon. You can see that CSC is already proposing a metric that has more sense in this aggregation level, since it's not static and the information can vary between jobs.
I think so. I think that it should be part of the log retrieval mechanism. If the number of files is an issue, it should be an issue for Autosubmit. If we are concerned about the amount of files that the API should have to check, then we should only work with files at the Autosubmit level. Then Autosubmit could insert this information in the DDBB for the API. I guess that we will need a new column (in which we can store JSON outputs as value, using a experiment/job hierarchy, since the amount of metrics can vary) unless you wanted new tables for this |
In GitLab by @mcastril on May 6, 2024, 10:55
There are some performance metrics that cannot be easily calculated from the current parameters that Autosubmit stores in the DDBB in a homogeneous way across different workflows. Some examples are the Coupling Cost or the Complexity. In the past, we agreed to enable a mechanism by which the workflow would be responsible for generating this data, which should be provided by Autosubmit/Autosubmit-API in the way that we decide to be available as an endpoint in the API.
For instance, the workflow can provide a YAML file written in a predetermined path (we could make it fixed or have a parameter in the configuration to allow the users to change this path) that the Autosubmit API would check when the endpoint is reached.
As we are moving to a different DDBB backend and trying to make Autosubmit less dependent on shared files in the filesystem, I understand this can be an issue. We could decide if it's worth having a field in the DDBB to store these additional metrics in bulk by Autosubmit (then it would be Autosubmit, and not the API the one that consumes the workflow file).
This development has not moved forward in the last few years due to the lack of a real necessity, but it's a requirement from DestinE in Phase 2 (August).
Performance metrics: https://docs.google.com/document/d/12yWDwXsohf4G4MPeP6e3Eil4ZL-YeIN71dBcoWRliEg/edit
Previous decisions about how to implement this:
https://earth.bsc.es/gitlab/es/autosubmit/-/issues/674#note_160254
https://earth.bsc.es/gitlab/es/autosubmit/-/issues/524#note_90901
CC @kinow @dbeltrankyl
The text was updated successfully, but these errors were encountered: