Skip to content

Sync meeting on EESSI test suite (2024 08 08)

Caspar van Leeuwen edited this page Aug 12, 2024 · 2 revisions

Planning

  • every 2 weeks on Thursday at 14:00 CE(S)T
  • next meetings:
    • Thu 16 Aug'24 14:00 CEST

Meeting (2024-07-25)

Attending: Sam Moors, Caspar van Leeuwen, Lara Peeters, Satish Kamath

  • Merged PRs

    • PyTorch test was merged #130
    • Bug fix: Find & report duplicate modules #167. Prints a warning if two modules with the same module were found on the MODULEPATH, since our find_modules hook would create 2 test instances for it, but both would run with the first module on the MODULEPATH
    • CP2K test was merged #133
    • Update Snellius config for H100 nodes #165
  • Open PRs

    • MetalWalls: needs review, who? => Caspar will try to run it. It builds on an hpctestlib case, so the test-suite part of this looks pretty clean (calling hooks, that's about it)

    • LAMMPS:

      • Sanity check needs to be added for Total Energy
      • Caspar will check the values in the report for 1 core, 1 node and 16 node for both systems, to see how much Total Energy varies
      • Caspar will check if he has another version of LAMMPS to run this with, and if that produces other Total Energy values
      • Caspar will try GPU runs as well
      • Lara: add check if CUDA-based module is built with Kokkos
    • mpi4py [WIP]:

      • Part of the Tutorial, not such an important test on its own, so no rush
    • Adapt common_eessi_init to allow using local modules more easily #166

      • Sam: add changes for other config files too
      • Sam: try to rephrase the warning so that it is clear that 'it is still ok' what the user is trying to do, as long as they didn't intend to run with the EESSI modules.
  • No progress on Tutorial for writing a portable test

  • Apply memory limits using memory hook for all tests

    • Caspar will go through the tests and update them where needed
    • Suggestion: run top and dump info to figure out max memory useage, e.g. for i in {1..4}; do sleep 0.1 && top -b -n1 | grep "MiB Mem" ; done > cron.txt or for a specific process for i in {1..4}; do sleep 0.1 && top -b -n1 -p <pid>; done > cron.txt. More info: https://www.tecmint.com/save-top-command-output-to-a-file/
    • Suggestion 2: get it directly from /proc/<pid>/status
    • Suggestion 3: get the maximum useage from the C-group at the end of the job cat /sys/fs/cgroup/memory/$(</proc/self/cpuset)/memory.max_usage_in_bytes
    • EDIT 12-08-2024 by Caspar: I've tested all those options, only suggestion 3 works. (1) gives you total memory useage on the node - including OS processes, other users (if it's a shared node) etc. That's not representative for what the application needs, and OS processes do not have to be included in e.g. a SLURM memory request. (2) doesn't work because cat /proc/<pid>/status only gives you one sample of the memory useage, at the moment you cat the file. If you execute it after the parallel command has finished, that file doesn't exist anymore, so that also fails. Option (3) seems to return a reasonable number for LAMMPS (2.8 GB for 24 ranks). I haven't tested it's reliability, but I could double check against the memory requests we do for e.g. CP2K (where Sam has tested the requirements pretty extensively).
  • OpenFOAM test

    • Satish has a test that works, but no ReFrame test => No progress
      • Switching to different test case, inspired by ExaFOAM project (lid-driven cavity case).
  • Sam noticed loading the pilot repo doesn't work anymore

    • Lara: it requires to set EESSI_FORCE_PILOT if you are sure you still want the pilot
Clone this wiki locally