-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add configuration file for Graham (#169) #172
base: master
Are you sure you want to change the base?
Commits on Jul 31, 2017
-
Configuration menu - View commit details
-
Copy full SHA for bd86e21 - Browse repository at this point
Copy the full SHA bd86e21View commit details
Commits on Aug 8, 2017
-
Configuration menu - View commit details
-
Copy full SHA for 1168e3e - Browse repository at this point
Copy the full SHA 1168e3eView commit details -
Configuration menu - View commit details
-
Copy full SHA for 243b191 - Browse repository at this point
Copy the full SHA 243b191View commit details
Commits on Aug 9, 2017
-
Configuration menu - View commit details
-
Copy full SHA for 0a16243 - Browse repository at this point
Copy the full SHA 0a16243View commit details
Commits on Aug 11, 2017
-
Configuration menu - View commit details
-
Copy full SHA for 6f31489 - Browse repository at this point
Copy the full SHA 6f31489View commit details
Commits on Sep 1, 2017
-
Configuration menu - View commit details
-
Copy full SHA for 3b0dd0b - Browse repository at this point
Copy the full SHA 3b0dd0bView commit details -
Configuration menu - View commit details
-
Copy full SHA for e8e6ec0 - Browse repository at this point
Copy the full SHA e8e6ec0View commit details
Commits on Sep 19, 2017
-
Updated tests to skip on Graham and Cedar smartdispatch modified to h…
…andle slurm clusters
Configuration menu - View commit details
-
Copy full SHA for 04f0ff1 - Browse repository at this point
Copy the full SHA 04f0ff1View commit details
Commits on Sep 26, 2017
-
Configuration menu - View commit details
-
Copy full SHA for d7d0300 - Browse repository at this point
Copy the full SHA d7d0300View commit details
Commits on Oct 6, 2017
-
Configuration menu - View commit details
-
Copy full SHA for c7bb250 - Browse repository at this point
Copy the full SHA c7bb250View commit details
Commits on Oct 10, 2017
-
5
Configuration menu - View commit details
-
Copy full SHA for a20405b - Browse repository at this point
Copy the full SHA a20405bView commit details
Commits on Oct 16, 2017
-
Configuration menu - View commit details
-
Copy full SHA for 5bde1c6 - Browse repository at this point
Copy the full SHA 5bde1c6View commit details -
Configuration menu - View commit details
-
Copy full SHA for 0ff4776 - Browse repository at this point
Copy the full SHA 0ff4776View commit details -
Why: For each option, add_sbatch_option would add the option in both the form --[OPTION_NAME] and [OPTION_NAME].
Configuration menu - View commit details
-
Copy full SHA for d1ad338 - Browse repository at this point
Copy the full SHA d1ad338View commit details -
It will need many conversions, not only on resources, so better make it clean.
Configuration menu - View commit details
-
Copy full SHA for 899167f - Browse repository at this point
Copy the full SHA 899167fView commit details -
Remove queue name for Slurm clusters
Slurm has no queues, so PBS option -q is invalid and non-convertible.
Configuration menu - View commit details
-
Copy full SHA for 581f835 - Browse repository at this point
Copy the full SHA 581f835View commit details -
Replace PBS_JOBID with SLURM_JOB_ID
$PBS_JOBID was used to set the stdout/err of the job as well as in the commands. Replace them with $SLURM_JOB_ID. Also, workers were accessing os.environ[PBS_JOBID] so we added a second fetch on SLURM_JOB_ID in case os.environ[PBS_JOBID] gave undefined.
Configuration menu - View commit details
-
Copy full SHA for 1d74e1a - Browse repository at this point
Copy the full SHA 1d74e1aView commit details -
Add PBS_FILENAME definition to pbs.prolog
Slurm cannot be passed environment variables defined locally on command-line like PBS_FILENAME is. To bypass this, we add a definition in the prolog, making PBS_FILENAME available to all commands and epilog. NOTE: We leave PBS_FILENAME definition in command-line too such that any user using $PBS_FILENAME inside a custom pbsFlag can still do so.
Configuration menu - View commit details
-
Copy full SHA for f00e877 - Browse repository at this point
Copy the full SHA f00e877View commit details -
Fix env var export option for Slurm
PBS options -V is not converted properly to SBATCH --export ALL. We remove it and replace it with --export=ALL is the sbatch options.
Configuration menu - View commit details
-
Copy full SHA for 6b2d530 - Browse repository at this point
Copy the full SHA 6b2d530View commit details -
Slurm does not have a equivalent environment variable set like PBS_WALLTIME. To avoid confusion, all variables PBS_WALLTIME are renamed to SBATCH_TIMELIMIT (the environment variable one would use to set --time with sbatch). As SBATCH_TIMELIMIT is not set automatically, we add it to the prolog to make it available to all commands and epilog. NOTE: PBS_WALLTIME is set in seconds, but we only have HH:MM:SS-like strings at the time of building the PBS file. We needed to add a walltime_to_seconds helper function to convert HH:MM:SS like strings into seconds, so that SBATCH_TIMELIMIT is set with seconds like PBS_WALLTIME.
Configuration menu - View commit details
-
Copy full SHA for 21df3dd - Browse repository at this point
Copy the full SHA 21df3ddView commit details -
Configuration menu - View commit details
-
Copy full SHA for ea1d5b3 - Browse repository at this point
Copy the full SHA ea1d5b3View commit details -
Make get_launcher more flexible
It is possible to query the system to see if some commands are available using distutils.spawn.find_executable(command_name). Clusters where more than one launcher are available will still get launchers selected based on string matching. For instance, get_launcher("helios") would always return msub no matter what is available on the system.
Configuration menu - View commit details
-
Copy full SHA for adb8cba - Browse repository at this point
Copy the full SHA adb8cbaView commit details -
Add verbosity to smart-dispatch
It is difficult to debug resuming while important process are taking place in the pbs script automatically built by SmartDispatch. We add verbose to smart-dispatch script and add debugging prints in epilog.
Configuration menu - View commit details
-
Copy full SHA for f3661ba - Browse repository at this point
Copy the full SHA f3661baView commit details -
Configuration menu - View commit details
-
Copy full SHA for 972a1ab - Browse repository at this point
Copy the full SHA 972a1abView commit details -
Add support for SlurmJobGenerator
JobGenerators are selected by job_generator_factory based on the cluster's name. We use a more flexible, duck typing approach for Slurm clusters. If cluster name is not known, or not any of the if-case clauses in the factory, then we look at which launchers are available in the system. If it is sbatch, then a SlurmJobGenerator is built, a JobGenerator otherwise.
Configuration menu - View commit details
-
Copy full SHA for 29973b0 - Browse repository at this point
Copy the full SHA 29973b0View commit details -
Print stderr when both qsub and sacctmgr fails
The command `sacctmgr` fails on some computers (mila01 namely), but the current behavior gives the impression sbatch is simply not available. Printing the stderr makes it more obvious that sbatch should be available, but something is broken behind sacctmgr. It only appears when using -vv options nevertheless.
Configuration menu - View commit details
-
Copy full SHA for f734fb3 - Browse repository at this point
Copy the full SHA f734fb3View commit details -
Add automatic script for cluster verification
Adding a script to do automatic verifications to assert validity of the current code. The verifications are not automatic unit-tests, they need automatically checks that the process executed successfully, but the administrator still needs to verify manually, reading the logs, that the requested resources were provided. Verifications can easily be combined, building on top of each others, from complex ones to simpler ones. Here is a list of all the verification currently implemented for slurm clusters: 1. very_simple_task (1 CPU) 2. verify_simple_task_with_one_gpu (1 CPU 1 GPU) 3. verify_simple_task_with_many_gpus (1 CPU X GPU) 4. verify_many_task (X CPU) 5. verify_many_task_with_many_cores (XY CPU) 6. verify_many_task_with_one_gpu (X CPU X GPU) 7. verify_many_task_with_many_gpus (X CPU Y GPU) 8. verify_simple_task_with_autoresume_unneeded (1 CPU) 9. verify_simple_task_with_autoresume_needed (1 CPU) 10. verify_many_task_with_autoresume_needed (X CPU)
Configuration menu - View commit details
-
Copy full SHA for 4506887 - Browse repository at this point
Copy the full SHA 4506887View commit details -
Configuration menu - View commit details
-
Copy full SHA for 02845e0 - Browse repository at this point
Copy the full SHA 02845e0View commit details -
Configuration menu - View commit details
-
Copy full SHA for 2d6e6fd - Browse repository at this point
Copy the full SHA 2d6e6fdView commit details -
Configuration menu - View commit details
-
Copy full SHA for f967180 - Browse repository at this point
Copy the full SHA f967180View commit details -
Make get_launcher return None when no launcher
My initial though was that get_launcher should raise an error when no launcher is found on the system since there cannot be any job launcher. I realized that this would break the --doNotLaunch option that users may want to use on system with no launcher, just to create the files.
Configuration menu - View commit details
-
Copy full SHA for 8c655b4 - Browse repository at this point
Copy the full SHA 8c655b4View commit details -
Configuration menu - View commit details
-
Copy full SHA for 998f3ba - Browse repository at this point
Copy the full SHA 998f3baView commit details -
Set properly account in verify_graham
The tests were failing because the account was not specified.
Configuration menu - View commit details
-
Copy full SHA for a3c08c8 - Browse repository at this point
Copy the full SHA a3c08c8View commit details -
Set properly account in verify_cedar
The tests were failing because the account was not specified
Configuration menu - View commit details
-
Copy full SHA for 9fb5ab6 - Browse repository at this point
Copy the full SHA 9fb5ab6View commit details
Commits on Oct 17, 2017
-
Fix walltime_to_seconds convertion
There was a missing parentheses which was causing a bad conversion of "DD:HH:MM:SS" to seconds. The unit-test was also missing the same parentheses. I added a unit-test to make sure such error could not occur again.
Configuration menu - View commit details
-
Copy full SHA for 1dea0d8 - Browse repository at this point
Copy the full SHA 1dea0d8View commit details -
Configuration menu - View commit details
-
Copy full SHA for cac2f08 - Browse repository at this point
Copy the full SHA cac2f08View commit details