Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a0py (climatedt-vm) rebuild of the database #72

Open
LuiggiTenorioK opened this issue Mar 14, 2024 · 21 comments
Open

a0py (climatedt-vm) rebuild of the database #72

LuiggiTenorioK opened this issue Mar 14, 2024 · 21 comments
Assignees
Labels
working on Someone is working on it

Comments

@LuiggiTenorioK
Copy link
Member

In GitLab by @manuel-g-castro on Mar 14, 2024, 10:50

@dbeltrankyl and @LuiggiTenorioK to figure out how to repopulate the database of the production run of ifs+nemo workflow on LUMI, a0py.

All of this to get the sypd, chpy, etc.

fyi @ainagaya

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @dbeltrankyl on Mar 15, 2024, 16:17

Hello @mcastril

I recovered part of the database on my laptop.

I used the TOTAL_STAT FILE last row and the .cmd on the remote ( due autosubmit inspect overwriting the local ones )

You can visualize it with sqlitebrowser

job_data_a0py.db.fixed

It is missing some fields like children and job_id

Some jobs didn't have a finished STAT file ( this means that AS was not aware that the job was finished, I guess due to the set status issue )

Python script:

recover_stats.py

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @kinow on Mar 18, 2024, 14:04

This script looks useful, Dani! Maybe we can include this in GitLab, somewhere like ./scripts?

Some jobs didn't have a finished STAT file ( this means that AS was not aware that the job was finished, I guess due to the set status issue )

The configuration should contain some of that, like ncpus, wallclock, qos, but it may have changed, right? Would it be possible to locate the missing information in some other logs? Like parsing the cmd file and getting the #SBATCH parameters? And get the start/finish/status/etc from the job file?

And great job!!!

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @dbeltrankyl on Mar 18, 2024, 14:15

Hello @kinow ,

thanks!

The configuration should contain some of that, like ncpus, wallclock, qos, but it may have changed, right? Like parsing the cmd file and getting the #SBATCH parameters?

Yes, I parsed the remote .cmds due that. If you check the db, you can see that the sim have a different number of nodes ncpus.

IT is not enough with the .cmd on the local machine because the inspect may override them

And get the start/finish/status/etc from the job file?
This is retrieved from total_stat / stat files.

My issue is with the TOTAL_STAT files as this is filled by Autosubmit when it acknowledges that the job has ready, started, submitted, finished, some of them doesn't have the finished timestamp. I guess it happened due the issues with all to waiting jobs.

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @kinow on Mar 18, 2024, 14:17

My issue is with the TOTAL_STAT files as this is filled by Autosubmit when it acknowledges that the job has ready, started, submitted, finished, some of them doesn't have the finished timestamp. I guess it happened due the issues with all to waiting jobs.

Could we use one of those... sstat, sinfo, scontrol, etc., to retrieve this information directly from the platform's job scheduler?

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @dbeltrankyl on Mar 18, 2024, 16:25

I've downloaded the remote _STAT and I fixed some of stuff missing or corrected timestamps like 19700101020000.

Also I realized that I stored the datetime when it should be the timestamp

recover_stats.py

job_data_a0py.db.fixed2

Edit:

still missing stuff:

a0py_20200101_fc0_39_SIM is incomplete
a0py_20200101_fc0_25_SIM is incomplete

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @mcastril on Mar 18, 2024, 20:11

Thank you Dani, I was reviewing the data and I was going to tell you about the timestamps.

The numbers make sense for me, but I don't see the OPA and APP jobs. I think they were missing, too

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @dbeltrankyl on Mar 19, 2024, 09:26

Ah, I only look at SIM ones,

I have an issue with the OPA files and APP files.

The permissions are weird:

-rw-rw----. 1 sm1       sm1                  549 Feb 27 15:21 only_lra_20200101_fc0_99_10_APP_URBAN
-rw-rw----. 1 sm1       sm1                  549 Feb 27 15:24 only_lra_20200101_fc0_99_11_APP_URBAN
-rw-rw----. 1 sm1       sm1                  549 Feb 27 15:27 only_lra_20200101_fc0_99_12_APP_URBAN
-rw-rw----. 1 sm1       sm1                  549 Feb 27 15:30 only_lra_20200101_fc0_99_13_APP_URBAN
-rw-rw----. 1 sm1       sm1                  549 Feb 27 15:33 only_lra_20200101_fc0_99_14_APP_URBAN
-rw-rw----. 1 sm1       sm1                  549 Feb 27 15:36 only_lra_20200101_fc0_99_15_APP_URBAN
-rw-rw----. 1 sm1       autosubmit_users     549 Feb 27 14:55 only_lra_20200101_fc0_99_1_APP_URBAN

I can't download some of them.

Can someone add R (or change the group to autosubmit_users) for others in the cmd, TOTAL_STAT, and STAT files, Or all files under /appl/AS/AUTOSUBMIT_DATA/a0py/tmp/

FYI: @ainagaya , @franra9 , @Lerriola ?

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @kinow on Mar 19, 2024, 09:31

The permissions are weird:

@dbeltrankyl, some days ago Kai mentioned autosubmit expid -y was failing. I checked the permissions, and they were like that. Henrik apologised and said there was a maintenance and they accidentally moved sm1 user to sm1 group. He fixed it, and sm1 is now in autosubmit_users group. But any files created during that window of time resulted in files with user+group sm1+sm1. Someone with the robot account should be able to change the file mode or group of those files.

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @ainagaya on Mar 19, 2024, 09:37

On it!

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @ainagaya on Mar 19, 2024, 09:39

chgrp autosubmit_users *, no? (don't want to mess up...)

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @dbeltrankyl on Mar 19, 2024, 09:40

Yes, thanks!

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @ainagaya on Mar 19, 2024, 09:43

Done :) (in the end I had to do find . -maxdepth 1 -type f -exec chgrp autosubmit_users {} + because it was complaining with "list too long" but I think its fine)

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @dbeltrankyl on Mar 19, 2024, 09:45

Yes, thanks Aina. Now I can download them

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @ainagaya on Mar 21, 2024, 09:35

Hi, I got this message from Kai:

we need the AS statistics to get working... I mean the AS statistics for the scenario, so that we can see how the failures evolve etc

Is this related to this issue? Sorry I'm a bit lost.

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @dbeltrankyl on Mar 21, 2024, 09:42

I think so,

Are you using v4.1.2 yet? v4.1.2 should be able to store the running jobs ( well if they start and finishes with 4.1.2)

if not I need to add this week sim_chunks to the .fixed db

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @ainagaya on Mar 21, 2024, 09:51

a0py still uses dev-8 version. Do you think that is safe to migrate?

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @dbeltrankyl on Mar 21, 2024, 11:47

I think so, we talked about this in the as meeting and Miguel mention to discuss it in the afternoon meeting

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @mcastril on Apr 4, 2024, 14:43

Hi Dani, what's the current DDBB status?

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @dbeltrankyl on Apr 4, 2024, 15:04

Hello,

I did not touch the original one, but I have a local copy with opa,app and sim data

I can apply the fix at any moment ( I need to download the last info) is the production experiment running ( or is it finished? ) with 4.1.2+? If so, I can apply the fix tomorrow

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @dbeltrankyl on Apr 4, 2024, 15:11

I see that it is still using the dev one, in that case, I can apply the fix, but we will have the same issue for newer jobs. If that is okay I can upload it anyway

@LuiggiTenorioK
Copy link
Member Author

In GitLab by @mcastril on Apr 24, 2024, 12:52

a0py is using v4.1.2 since few weeks ago

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
working on Someone is working on it
Projects
None yet
Development

No branches or pull requests

2 participants