gfsmetp jobs do not generate stats files on Hera and Hercules #2759

RussTreadon-NOAA · 2024-07-11T17:32:30Z

What is wrong?

gfsmetp jobs run in Hera testprepobs and Hercules testprepobsherc (see issue #2694) finish with status=0 (success) but no metplus stats files are generated. For example, Hera 2021122200/gfsmetpg2g1.log contains

gfsmetpg2g1.log:** WARNING: /scratch2/NCEPDEV/stmp1/Kate.Friedman/RUNDIRS/testprepobs/metpg2g1.754908/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/anom/testprepobs/testprepobs_2021122200.stat was not generated or zero size
gfsmetpg2g1.log:** WARNING: /scratch2/NCEPDEV/stmp1/Kate.Friedman/RUNDIRS/testprepobs/metpg2g1.754908/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/pres/testprepobs/testprepobs_2021122200.stat was not generated or zero size
gfsmetpg2g1.log:** WARNING: /scratch2/NCEPDEV/stmp1/Kate.Friedman/RUNDIRS/testprepobs/metpg2g1.754908/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/sfc/testprepobs/testprepobs_2021122200.stat was not generated or zero size

What should have happened?

gfsmetp jobs run in WCOSS2 (Dogwood) testprepobsherc (see issue #2694) finish with status=0 (success) AND create non-zero length metplus stats files. for example,

ls -l /lfs/h2/emc/global/noscrub/kate.friedman/archive/metplus_data/by_VSDB/grid2grid/anom/00Z/devcycprepobs/
total 180
-rw-r--r-- 1 kate.friedman global  59967 Jul  8 17:36 devcycprepobs_20211221.stat
-rw-r--r-- 1 kate.friedman global 119726 Jul  8 21:17 devcycprepobs_20211222.stat

Non-zero size stats files should also be created in the Hera and Hercules parallels.

What machines are impacted?

Hera, Hercules

Steps to reproduce

Set up and run the parallels described in issue #2694

Additional information

gfsmetp*log files on Hera and Hercules both contain srun: error messages. For example,

Hera(hfe03):/scratch2/NCEPDEV/stmp1/Kate.Friedman/comrot/testprepobs/logs/2021122200$ grep "srun: error" gfsmetp*log
gfsmetpg2g1.log:srun: error: Configuration file /scratch2/NCEPDEV/stmp1/Kate.Friedman/RUNDIRS/testprepobs/metpg2g1.754908/grid2grid_step1/metplus_job_scripts/poe_jobs1 invalid, no record for task id 1
gfsmetpg2g1.log:srun: error: Configuration file /scratch2/NCEPDEV/stmp1/Kate.Friedman/RUNDIRS/testprepobs/metpg2g1.754908/grid2grid_step1/metplus_job_scripts/poe_jobs2 invalid, no record for task id 1
gfsmetpg2g1.log:srun: error: Configuration file /scratch2/NCEPDEV/stmp1/Kate.Friedman/RUNDIRS/testprepobs/metpg2g1.754908/grid2grid_step1/metplus_job_scripts/poe_jobs3 invalid, no record for task id 1
gfsmetpg2o1.log:srun: error: Configuration file /scratch2/NCEPDEV/stmp1/Kate.Friedman/RUNDIRS/testprepobs/metpg2o1.693766/grid2obs_step1/metplus_job_scripts/poe_jobs1 invalid, no record for task id 1
gfsmetpg2o1.log:srun: error: Configuration file /scratch2/NCEPDEV/stmp1/Kate.Friedman/RUNDIRS/testprepobs/metpg2o1.693766/grid2obs_step1/metplus_job_scripts/poe_jobs2 invalid, no record for task id 1
gfsmetppcp1.log:srun: error: Configuration file /scratch2/NCEPDEV/stmp1/Kate.Friedman/RUNDIRS/testprepobs/metppcp1.694915/precip_step1/metplus_job_scripts/poe_jobs1 invalid, no record for task id 1

Do you have a proposed solution?

Why is the srun: error message generated? Would fixing this error result in metplus stats files being generated?

The text was updated successfully, but these errors were encountered:

CatherineThomas-NOAA · 2024-07-11T18:39:03Z

Confirming that I am seeing the same thing in my C384 experiments, both hybrid ATM-only and 3DVar S2S. Those experiments use workflows from July 5 and July 3 respectively.

RussTreadon-NOAA · 2024-07-11T18:39:26Z

Added the following lines back to the metp section of config.resources

    export npe_metp_gfs=4
    export npe_node_metp_gfs=4

Rerun of Hera 2024050400 gfsmetpg2g1. This time the job generated metplus stats files.

Hera(hfe07):/scratch1/NCEPDEV/da/role.jedipara/archive/metplus_data/by_VSDB/grid2grid$ ls -l */00Z/prjedi/
anom/00Z/prjedi/:
total 112
-rw-r--r-- 1 role.jedipara da 113803 Jul 11 18:30 prjedi_20240504.stat

pres/00Z/prjedi/:
total 308
-rw-r--r-- 1 role.jedipara da 313780 Jul 11 18:30 prjedi_20240504.stat

sfc/00Z/prjedi/:
total 168
-rw-r--r-- 1 role.jedipara da 171584 Jul 11 18:30 prjedi_20240504.stat

For details see /scratch1/NCEPDEV/stmp2/role.jedipara/COMROOT/prjedi/logs/2024050400/gfsmetpg2g1.log.

The _gfs lines were removed from the metp section of config.resources at b902c0b. See g-w PR #2672 for details.

DavidHuber-NOAA · 2024-07-11T18:54:25Z

@RussTreadon-NOAA Interesting. The issue appears to reside in verif-global.fd/ush/run_verif_global_in_global_workflow.sh, line 252:

export nproc=${npe_node_metp_gfs:-1}

This should probably be looking for npe_metp rather than npe_node_metp. Also, this should probably be a variable exported in the env files rather than cross-referencing a resource variable from the global-workflow. I will make updates to verif-global and the global-workflow then test this.

RussTreadon-NOAA · 2024-07-11T19:24:27Z

@DavidHuber-NOAA , your solution works. I made the recommended changes in a working copy of g-w on Hera:

return config.resources to its previous state .. that is, remove the _gfs variables from the metp section
comment out the nproc line in verif-global.fd/ush/run_verif_global_in_global_workflow.sh
add a metp section to HERA.env containing export nproc=${npe_metp:-1}

The prjedi 2024050400 gfsmetp jobs were rewound and rebooted. Each successfully ran to completion with metplus stats files generated.

CatherineThomas-NOAA · 2024-07-11T19:41:34Z

One thing that concerns me is that the metp jobs did not produce any stat files, yet the jobs did not fail and were marked "SUCCEEDED". Is there an easy way we can improve the scripts to catch this in the future?

RussTreadon-NOAA · 2024-07-12T10:07:03Z

I agree @CatherineThomas-NOAA . The error code 0 status for the failed gfsmetp jobs is troubling. I only noticed this silent fail because I am running a short ATM JEDI parallel and as was checking all the output to ensure jobs ran correctly.

The failed gfsmetp jobs contain warnings that no stats files were produced. For example the 2024050400 gfsmetpg2g1.log.3 contains

+ exgrid2grid_step1.sh[106]: python /scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/test_hera/sorc/verif-global.fd/ush/copy_stat_files.py
BEGIN: copy_stat_files.py
**************************************************
** WARNING: /scratch1/NCEPDEV/stmp2/role.jedipara/RUNDIRS/prjedi/metpg2g1.1223238/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/anom/prjedi/prjedi_2024050400.stat was not generated or zero size
**************************************************

**************************************************
** WARNING: /scratch1/NCEPDEV/stmp2/role.jedipara/RUNDIRS/prjedi/metpg2g1.1223238/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/pres/prjedi/prjedi_2024050400.stat was not generated or zero size
**************************************************

**************************************************
** WARNING: /scratch1/NCEPDEV/stmp2/role.jedipara/RUNDIRS/prjedi/metpg2g1.1223238/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/sfc/prjedi/prjedi_2024050400.stat was not generated or zero size
**************************************************

END: copy_stat_files.py

Script copy_stat_files.py could be modified to generate an error code when the above situation is encountered. That being said, perhaps there are reasons this is not done in the current version of copy_stat_files.py. I don't know the development history of nor the uses cases for copy_stat_files.py.

Tagging @malloryprow for awareness.

malloryprow · 2024-07-12T11:18:17Z

Hi @RussTreadon-NOAA, I think we had the gfsmetp jobs fail silently so they didn't hold up the rest of the workflow if things failed. I think this is something that was discussed a long time ago.

Did you find why the jobs failed?

RussTreadon-NOAA · 2024-07-12T11:47:04Z

@malloryprow , yes @DavidHuber-NOAA found the reason for the failure (see above).

The danger with silent failures is that a developer could run a parallel assuming gfsmetp success means stats are generated only to find that the stats are missing when (s)he goes to plot them. Of course, generating the stats after the fact isn't hard. It just takes additional time.

malloryprow · 2024-07-12T12:46:47Z

Ah missed those! I definitely get not wanting the silent failure. It should be something easily fixed. If no stat files were copied, exit with an error code. Does that sound appropriate?

DavidHuber-NOAA · 2024-07-12T13:09:43Z

I have a couple of ideas on how to resolve this and maintain the ability to push past failed METplus jobs:

Add a configuration flag, say CHECK_METplus, that will cause a failure in the METplus jobs if the expected data file is missing or size 0 (perhaps more robust checking could be added as well for valid entries).
A file check could be added to the cleanup rocoto XML so users would receive an email if there was a failure. Something like:

<sh shell="/bin/sh"><cyclestr>&HOMEgfs;/ush/check_metp.sh &ARCDIR;/metplus_data/by_VSDB/grid2grid/anom/@HZ/&pslot;/&pslot;_@Y@[email protected]</cyclestr></sh>

check_metp.sh would then consist of something like the following:

filename=${1:-""}
[ -z "${filename}" ] && echo "METplus filename is an empty string" && exit 0
[ ! -f "${1}" ] && echo "METplus output file $1 does not exist!" && exit 0
[ -f "$1" ] && echo "METplus output file $1 is zero-sized!" && exit 0

malloryprow · 2024-07-12T13:13:43Z

I can see that being a helpful avenue. Is that something that would run at the end of every cycle? I think checks would be helpful too for the grid2obs and precip stat files too.

DavidHuber-NOAA · 2024-07-12T13:17:02Z

@malloryprow Yes, it would run at the end of every cycle after all jobs for a particular RUN (gdas, gfs) are complete. It would only be executed if METplus was part of that RUN. I agree, it would be helpful for the other METplus products.

malloryprow · 2024-07-12T13:22:10Z

I think it may be better to include in copy_stat_files.py. The metp task don't produce anything for the gdas cycles or every gfs cycle.

DavidHuber-NOAA · 2024-07-12T13:28:34Z

OK, noted. I could code it so that it only runs for gfs cycles. Is it predictable which gfs cycles will produce data?

malloryprow · 2024-07-12T13:49:57Z

Having to dust some cobwebs off my brain here since it has been a good while since I have ran anything with the global workflow...if only one gfs cycle is being run it will run for that one every time, if more than one it should all run at the 00Z cycle. Kind of weird but when I first put the gfsmetp tasks in this is how VSDB was set up to run so I copied that "style".

DavidHuber-NOAA · 2024-07-12T13:56:25Z

Interesting. In that case, I agree that putting something in the XML would be quite challenging and instead placing a check in copy_stat_files.py makes sense. A failed status could then be issued if desired.

malloryprow · 2024-07-12T16:31:01Z

@RussTreadon-NOAA Would you be able to copy /scratch1/NCEPDEV/global/Mallory.Row/VRFY/EMC_verif-global/ush/get_data_files.py on Hera into the needed location, and then rerun the gfsmetp tasks in the configuration it was when it failed?

RussTreadon-NOAA · 2024-07-12T16:36:01Z

@malloryprow , I no longer have a g-w configuration which fails. My local copy of g-w contains the changes @DavidHuber-NOAA recommended.

malloryprow · 2024-07-12T16:38:51Z

Ah okay. The changes I made worked when running standalone but wanted confirmation that the desired behavior also happens in the global workflow.

CatherineThomas-NOAA · 2024-07-12T17:15:16Z

@malloryprow I still have my test that I can set to run another few cycles. I'll copy that file and let you know how it goes.

CatherineThomas-NOAA · 2024-07-12T17:18:28Z

@malloryprow: Is it get_data_files.py or copy_stat_files.py that I should be copying? The file listed is the same as the version in my clone.

RussTreadon-NOAA · 2024-07-12T17:19:37Z

@malloryprow . I broke my configuration by reverting @DavidHuber-NOAA 's changes. I reran the 2024050500 gfsmetpg2g1. Warning messages were printed but the job still completed with error code 0.

I then realized that your comment refers to /scratch1/NCEPDEV/global/Mallory.Row/VRFY/EMC_verif-global/ush/get_data_files.py, not copy_stats_files.py.

I copied /scratch1/NCEPDEV/global/Mallory.Row/VRFY/EMC_verif-global/ush/copy_stats_files.py to my HOMEgfs and reran 2024050500 gfsmetpg2g1. Interestingly, the job still finished with error code 0 even though warning messages were written to the log file

+ exgrid2grid_step1.sh[106]: python /scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/test_hera/sorc/verif-global.fd/ush/copy_stat_files.py
BEGIN: copy_stat_files.py
**************************************************
** WARNING: /scratch1/NCEPDEV/stmp2/role.jedipara/RUNDIRS/prjedi/metpg2g1.1632198/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/anom/prjedi/prjedi_2024050500.stat was not generated or zero size
**************************************************

**************************************************
** WARNING: /scratch1/NCEPDEV/stmp2/role.jedipara/RUNDIRS/prjedi/metpg2g1.1632198/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/pres/prjedi/prjedi_2024050500.stat was not generated or zero size
**************************************************

**************************************************
** WARNING: /scratch1/NCEPDEV/stmp2/role.jedipara/RUNDIRS/prjedi/metpg2g1.1632198/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/sfc/prjedi/prjedi_2024050500.stat was not generated or zero size
**************************************************

END: copy_stat_files.py
+ exgrid2grid_step1.sh[107]: status=0

I looked at the changes you made to copy_stat_files.py.

Your modified script checks the destination directory for stat files. Since my previous run with Dave's fixes worked, the stat files are already in the destination directory. I moved the existing stat files to a new name and reran the job. The gfsmetpg2g1 still finished with error code 0

I looked again at your modified copy_stat_files.py. It uses a wildcard to check for any stat files in the destination directory. I failed to move all the 20240505 stat files in the destination directory. I did this and tried one more time.

This time gfsmetpg2g1 failed with

+ exgrid2grid_step1.sh[106]: python /scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/test_hera/sorc/verif-global.fd/ush/copy_stat_files.py
BEGIN: copy_stat_files.py
**************************************************
** WARNING: /scratch1/NCEPDEV/stmp2/role.jedipara/RUNDIRS/prjedi/metpg2g1.1174486/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/anom/prjedi/prjedi_2024050500.stat was not generated or zero size
**************************************************

ERROR: No stat files matching /scratch1/NCEPDEV/da/role.jedipara/archive/prjedi/../metplus_data/by_VSDB/*/*/*Z/prjedi/prjedi_20240505.stat
+ exgrid2grid_step1.sh[107]: status=1

We should change the logic in copy_stat_files.py to check for each expected output file not just any file. This complicates your scripting but ensures nothing slips through the cracks.

@DavidHuber-NOAA recommends a change to verif-global.fd/ush/run_verif_global_in_global_workflow.sh. Do you have this change?

malloryprow · 2024-07-12T17:22:41Z

@malloryprow: Is it get_data_files.py or copy_stat_files.py that I should be copying? The file listed is the same as the version in my clone.

Ah sorry @CatherineThomas-NOAA. You're right it should be copy_stat_files.py

malloryprow · 2024-07-12T17:24:55Z

@RussTreadon-NOAA Thanks for testing. I see what you are saying. Let me add that in and update you.

And yes I see

@RussTreadon-NOAA Interesting. The issue appears to reside in verif-global.fd/ush/run_verif_global_in_global_workflow.sh, line 252:

export nproc=${npe_node_metp_gfs:-1}
This should probably be looking for npe_metp rather than npe_node_metp. Also, this should probably be a variable exported in the env files rather than cross-referencing a resource variable from the global-workflow. I will make updates to verif-global and the global-workflow then test this.

malloryprow · 2024-07-12T17:32:39Z

@RussTreadon-NOAA @CatherineThomas-NOAA

/scratch1/NCEPDEV/global/Mallory.Row/VRFY/EMC_verif-global/ush/copy_stats_files.py updated! Apologies for stating the wrong file earlier.

RussTreadon-NOAA · 2024-07-12T17:59:31Z

@malloryprow , your updated /scratch1/NCEPDEV/global/Mallory.Row/VRFY/EMC_verif-global/ush/copy_stat_files.py (no "s" after "stat") fails with

Traceback (most recent call last):
  File "/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/test_hera/sorc/verif-global.fd/ush/copy_stat_files.py", line 1\
35, in <module>
    sys.exit(1)
    ^^^
NameError: name 'sys' is not defined

I added import sys after import glob at the top of copy_stat_files.py. A reran successfully detected one missing stat file in my test.

malloryprow · 2024-07-12T18:13:58Z

Ah! Fixed that. Glad it detected the missing file.

RussTreadon-NOAA · 2024-07-16T19:52:23Z

@malloryprow , the local modifications to EMC-verif made in light of the above discussion are on Hera in /scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/test_hera/sorc/verif-global.fd. Two files are modified

ush/copy_stat_files.py - improved error handling for missing output stat files
ush/run_verif_global_in_global_workflow.sh - remove nproc.

@DavidHuber-NOAA , the change to ush/run_verif_global_in_global_workflow.sh requires the addition of nproc in a new metp section of g-w env/${machine}.env files. An example of the change is in env/HERA.env in /scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/test_hera/env

malloryprow · 2024-07-17T13:18:25Z

Thanks @RussTreadon-NOAA! I got the changes into the develop branch of EMC_verif-global at NOAA-EMC/EMC_verif-global@7118371.

@KateFriedman-NOAA updated the hash for EMC_verif-global :) The new hash is different from what I linked because it includes changes for running EMC_verif-global standalone on Orion following the Rocky upgrade. The hash has the needed changes.

DavidHuber-NOAA · 2024-07-17T13:29:56Z

@malloryprow I left a comment in the commit you referenced.

RussTreadon-NOAA added bug Something isn't working triage Issues that are triage labels Jul 11, 2024

RussTreadon-NOAA mentioned this issue Jul 11, 2024

Orion: Migration to Rocky9 OS #2694

Closed

DavidHuber-NOAA self-assigned this Jul 11, 2024

DavidHuber-NOAA removed the triage Issues that are triage label Jul 11, 2024

malloryprow mentioned this issue Jul 12, 2024

Add exit statements in copy_stat_files.py NOAA-EMC/EMC_verif-global#128

Closed

DavidHuber-NOAA assigned malloryprow Jul 16, 2024

DavidHuber-NOAA mentioned this issue Jul 17, 2024

Set METplus process count in config.metp; add verif-global support for Rocky 9 #2774

Merged

6 tasks

DavidHuber-NOAA closed this as completed in #2774 Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gfsmetp jobs do not generate stats files on Hera and Hercules #2759

gfsmetp jobs do not generate stats files on Hera and Hercules #2759

RussTreadon-NOAA commented Jul 11, 2024

CatherineThomas-NOAA commented Jul 11, 2024

RussTreadon-NOAA commented Jul 11, 2024

DavidHuber-NOAA commented Jul 11, 2024

RussTreadon-NOAA commented Jul 11, 2024

CatherineThomas-NOAA commented Jul 11, 2024

RussTreadon-NOAA commented Jul 12, 2024

malloryprow commented Jul 12, 2024

RussTreadon-NOAA commented Jul 12, 2024

malloryprow commented Jul 12, 2024 •

edited

Loading

DavidHuber-NOAA commented Jul 12, 2024

malloryprow commented Jul 12, 2024

DavidHuber-NOAA commented Jul 12, 2024

malloryprow commented Jul 12, 2024

DavidHuber-NOAA commented Jul 12, 2024

malloryprow commented Jul 12, 2024

DavidHuber-NOAA commented Jul 12, 2024

malloryprow commented Jul 12, 2024

RussTreadon-NOAA commented Jul 12, 2024

malloryprow commented Jul 12, 2024

CatherineThomas-NOAA commented Jul 12, 2024

CatherineThomas-NOAA commented Jul 12, 2024

RussTreadon-NOAA commented Jul 12, 2024

malloryprow commented Jul 12, 2024

malloryprow commented Jul 12, 2024

malloryprow commented Jul 12, 2024

RussTreadon-NOAA commented Jul 12, 2024

malloryprow commented Jul 12, 2024

RussTreadon-NOAA commented Jul 16, 2024

malloryprow commented Jul 17, 2024

DavidHuber-NOAA commented Jul 17, 2024

gfsmetp jobs do not generate stats files on Hera and Hercules #2759

gfsmetp jobs do not generate stats files on Hera and Hercules #2759

Comments

RussTreadon-NOAA commented Jul 11, 2024

What is wrong?

What should have happened?

What machines are impacted?

Steps to reproduce

Additional information

Do you have a proposed solution?

CatherineThomas-NOAA commented Jul 11, 2024

RussTreadon-NOAA commented Jul 11, 2024

DavidHuber-NOAA commented Jul 11, 2024

RussTreadon-NOAA commented Jul 11, 2024

CatherineThomas-NOAA commented Jul 11, 2024

RussTreadon-NOAA commented Jul 12, 2024

malloryprow commented Jul 12, 2024

RussTreadon-NOAA commented Jul 12, 2024

malloryprow commented Jul 12, 2024 • edited Loading

DavidHuber-NOAA commented Jul 12, 2024

malloryprow commented Jul 12, 2024

DavidHuber-NOAA commented Jul 12, 2024

malloryprow commented Jul 12, 2024

DavidHuber-NOAA commented Jul 12, 2024

malloryprow commented Jul 12, 2024

DavidHuber-NOAA commented Jul 12, 2024

malloryprow commented Jul 12, 2024

RussTreadon-NOAA commented Jul 12, 2024

malloryprow commented Jul 12, 2024

CatherineThomas-NOAA commented Jul 12, 2024

CatherineThomas-NOAA commented Jul 12, 2024

RussTreadon-NOAA commented Jul 12, 2024

malloryprow commented Jul 12, 2024

malloryprow commented Jul 12, 2024

malloryprow commented Jul 12, 2024

RussTreadon-NOAA commented Jul 12, 2024

malloryprow commented Jul 12, 2024

RussTreadon-NOAA commented Jul 16, 2024

malloryprow commented Jul 17, 2024

DavidHuber-NOAA commented Jul 17, 2024

malloryprow commented Jul 12, 2024 •

edited

Loading