Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build script and Nexus support for CSCS Piz Daint #4068

Open
wants to merge 29 commits into
base: develop
Choose a base branch
from

Conversation

zenandrea
Copy link
Contributor

Please review the developer documentation
on the wiki of this project that contains help and requirements.

Proposed changes

Describe what this PR changes and why. If it closes an issue, link to it here
with a supported keyword.

Added a configuration script for Daint (https://user.cscs.ch/access/running/piz_daint/) in
config/build_cscs_daint.sh
and added daint in the nexus machines

What type(s) of changes does this code introduce?

Delete the items that do not apply

  • Build related changes
  • Other (please describe): new machine in nexus

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

Daint

Checklist

Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is
simply a reminder of what we are going to look for before merging your code.

  • Yes. This PR is up to date with current the current state of 'develop'
  • No. Code added or changed in the PR has been clang-formatted
  • No. This PR adds tests to cover any new code, or to catch a bug that is being fixed
  • No. Documentation has been added (if appropriate)

@ye-luo
Copy link
Contributor

ye-luo commented Jun 20, 2022

Next time. Please always start making a PR by starting from develop to to help keep a clean history.
Don't need to change this time. I will enable squash merge.

@prckent prckent changed the title Daint Build script and Nexus support for CSCS Piz Daint Jun 20, 2022
@ye-luo ye-luo enabled auto-merge (squash) June 20, 2022 16:22
ye-luo
ye-luo previously approved these changes Jun 20, 2022
@prckent
Copy link
Contributor

prckent commented Jun 20, 2022

Thanks Andrea. I changed the title to be more informative. It makes e.g. writing the release notes much easier.

@prckent
Copy link
Contributor

prckent commented Jun 20, 2022

Test this please

@ye-luo
Copy link
Contributor

ye-luo commented Jun 20, 2022

@zenandrea ntest_nexus_machines seems failing.

649/914 Test #1645: ntest_nexus_machines ..............................................................................***Failed    3.02 sec
Test name     : machines
Test sublabel : test_process_job
Test exception: "AssertionError: "
Test backtrace:
  File "/__w/qmcpack/qmcpack/nexus/bin/nxs-test", line 478, in run
    self.operation()
  File "/__w/qmcpack/qmcpack/nexus/bin/nxs-test", line 991, in machines
    nunit('process_job')
  File "/__w/qmcpack/qmcpack/nexus/bin/nxs-test", line 349, in nunit
    run_external_unit_test(test_name,unit_test)
  File "/__w/qmcpack/qmcpack/nexus/bin/nxs-test", line 388, in run_external_unit_test
    unit_test()
  File "/__w/qmcpack/qmcpack/nexus/tests/unit/test_machines.py", line 923, in test_process_job
    assert(job.processes==job.nodes*job.processes_per_node)
Test status: fail

@prckent
Copy link
Contributor

prckent commented Jun 28, 2022

I took a look at this and see that it is correctly failing in test_machines.py around line 923. This checks for consistency between the total and per node process counts. The test uses random values that seemingly can't be satisfied the way the Daint machine is current implemented. I suggest to sync with @jtkrogel about what to do here. I don't see another machine with similar logic, so perhaps the test should be revised or some assumptions about how these machine specifications are written needed to be documented.

        # perform idempotency test
        machine_idempotent = True
        for job_input in job_inputs:
            job = Job(machine=machine.name,**job_input)
            assert(isinstance(job.processes,int))
            assert(isinstance(job.nodes,int))
            if job.processes_per_node is not None:
                assert(job.processes==job.nodes*job.processes_per_node)

@jtkrogel
Copy link
Contributor

@zenandrea the needed updates to the machines tests can be found the following way:

nxs-test -R machines --job_ref_table | grep daint

Use the results printed with this command (and also the full statement bracketed by '''...''' for daint in the non-grepped output) to update the reference table in nexus/tests/unit/test_machines.py. In that file you will find similar information already present for all other machines (e.g. search on summit).

Following the addition of these lines, nxs-test -R machines should pass.

auto-merge was automatically disabled July 26, 2022 14:06

Head branch was pushed to by a user without write access

@jtkrogel
Copy link
Contributor

This PR should go in. If there are any dangling issues, I will resolve them afterwards.

@prckent
Copy link
Contributor

prckent commented Jun 23, 2023

Did the Nexus tests ever pass?

@prckent
Copy link
Contributor

prckent commented Jun 23, 2023

(I have hopefully fixed the conflict correctly. Will see if the CI passes.)

@prckent
Copy link
Contributor

prckent commented Jun 23, 2023

Test this please

@prckent
Copy link
Contributor

prckent commented Jun 23, 2023

The following tests FAILED:
	1549 - ntest_nexus_machines (Failed)
Errors while running CTest

We can't merge this without either breaking the CI for ~every build or exempting Nexus from testing. Better to fix here. A short write up of how to correctly add a machine would allow me to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants