Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configuration file for Graham (#169) #172

Open
wants to merge 36 commits into
base: master
Choose a base branch
from

Commits on Jul 31, 2017

  1. Configuration menu
    Copy the full SHA
    bd86e21 View commit details
    Browse the repository at this point in the history

Commits on Aug 8, 2017

  1. New test for priority

    aalitaiga committed Aug 8, 2017
    Configuration menu
    Copy the full SHA
    1168e3e View commit details
    Browse the repository at this point in the history
  2. Added gres + memory tests

    aalitaiga committed Aug 8, 2017
    Configuration menu
    Copy the full SHA
    243b191 View commit details
    Browse the repository at this point in the history

Commits on Aug 9, 2017

  1. Refactored tests

    aalitaiga committed Aug 9, 2017
    Configuration menu
    Copy the full SHA
    0a16243 View commit details
    Browse the repository at this point in the history

Commits on Aug 11, 2017

  1. small update

    aalitaiga committed Aug 11, 2017
    Configuration menu
    Copy the full SHA
    6f31489 View commit details
    Browse the repository at this point in the history

Commits on Sep 1, 2017

  1. Configuration menu
    Copy the full SHA
    3b0dd0b View commit details
    Browse the repository at this point in the history
  2. Fixed naccelerators issue

    aalitaiga committed Sep 1, 2017
    Configuration menu
    Copy the full SHA
    e8e6ec0 View commit details
    Browse the repository at this point in the history

Commits on Sep 19, 2017

  1. Configuration menu
    Copy the full SHA
    04f0ff1 View commit details
    Browse the repository at this point in the history

Commits on Sep 26, 2017

  1. Configuration menu
    Copy the full SHA
    d7d0300 View commit details
    Browse the repository at this point in the history

Commits on Oct 6, 2017

  1. Updated tests

    aalitaiga committed Oct 6, 2017
    Configuration menu
    Copy the full SHA
    c7bb250 View commit details
    Browse the repository at this point in the history

Commits on Oct 10, 2017

  1. Updated tests using mock

    aalitaiga committed Oct 10, 2017
    5 Configuration menu
    Copy the full SHA
    a20405b View commit details
    Browse the repository at this point in the history

Commits on Oct 16, 2017

  1. Configuration menu
    Copy the full SHA
    5bde1c6 View commit details
    Browse the repository at this point in the history
  2. Small changes in TestSlurmQueue

    aalitaiga authored and bouthilx committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    0ff4776 View commit details
    Browse the repository at this point in the history
  3. Fix add_sbatch_option bug

    Why:
    
    For each option, add_sbatch_option would add the option in both the form
    --[OPTION_NAME] and [OPTION_NAME].
    bouthilx committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    d1ad338 View commit details
    Browse the repository at this point in the history
  4. Refactor SlurmJobGenerator

    It will need many conversions, not only on resources, so better make it
    clean.
    bouthilx committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    899167f View commit details
    Browse the repository at this point in the history
  5. Remove queue name for Slurm clusters

    Slurm has no queues, so PBS option -q is invalid and non-convertible.
    bouthilx committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    581f835 View commit details
    Browse the repository at this point in the history
  6. Replace PBS_JOBID with SLURM_JOB_ID

    $PBS_JOBID was used to set the stdout/err of the job as well as in the
    commands. Replace them with $SLURM_JOB_ID.
    
    Also, workers were accessing os.environ[PBS_JOBID] so we added a second
    fetch on SLURM_JOB_ID in case os.environ[PBS_JOBID] gave undefined.
    bouthilx committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    1d74e1a View commit details
    Browse the repository at this point in the history
  7. Add PBS_FILENAME definition to pbs.prolog

    Slurm cannot be passed environment variables defined locally on
    command-line like PBS_FILENAME is. To bypass this, we add a definition
    in the prolog, making PBS_FILENAME available to all commands and epilog.
    
    NOTE: We leave PBS_FILENAME definition in command-line too such that any
    user using $PBS_FILENAME inside a custom pbsFlag can still do so.
    bouthilx committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    f00e877 View commit details
    Browse the repository at this point in the history
  8. Fix env var export option for Slurm

    PBS options -V is not converted properly to SBATCH --export ALL.
    We remove it and replace it with --export=ALL is the sbatch options.
    bouthilx committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    6b2d530 View commit details
    Browse the repository at this point in the history
  9. Adapt PBS_WALLTIME for slurm

    Slurm does not have a equivalent environment variable set like
    PBS_WALLTIME. To avoid confusion, all variables PBS_WALLTIME are renamed
    to SBATCH_TIMELIMIT (the environment variable one would use to set --time
    with sbatch). As SBATCH_TIMELIMIT is not set automatically, we add it to
    the prolog to make it available to all commands and epilog.
    
    NOTE: PBS_WALLTIME is set in seconds, but we only have HH:MM:SS-like strings
    at the time of building the PBS file. We needed to add a
    walltime_to_seconds helper function to convert HH:MM:SS like strings
    into seconds, so that SBATCH_TIMELIMIT is set with seconds like
    PBS_WALLTIME.
    bouthilx committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    21df3dd View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    ea1d5b3 View commit details
    Browse the repository at this point in the history
  11. Make get_launcher more flexible

    It is possible to query the system to see if some commands are available
    using distutils.spawn.find_executable(command_name).
    
    Clusters where more than one launcher are available will still get
    launchers selected based on string matching. For instance,
    get_launcher("helios") would always return msub no matter what is
    available on the system.
    bouthilx committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    adb8cba View commit details
    Browse the repository at this point in the history
  12. Add verbosity to smart-dispatch

    It is difficult to debug resuming while important process are taking place in
    the pbs script automatically built by SmartDispatch.
    
    We add verbose to smart-dispatch script and add debugging prints in epilog.
    bouthilx committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    f3661ba View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    972a1ab View commit details
    Browse the repository at this point in the history
  14. Add support for SlurmJobGenerator

    JobGenerators are selected by job_generator_factory based on the
    cluster's name. We use a more flexible, duck typing approach for Slurm
    clusters. If cluster name is not known, or not any of the if-case
    clauses in the factory, then we look at which launchers are available in
    the system. If it is sbatch, then a SlurmJobGenerator is built,
    a JobGenerator otherwise.
    bouthilx committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    29973b0 View commit details
    Browse the repository at this point in the history
  15. Print stderr when both qsub and sacctmgr fails

    The command `sacctmgr` fails on some computers (mila01 namely), but the
    current behavior gives the impression sbatch is simply not available.
    
    Printing the stderr makes it more obvious that sbatch should be
    available, but something is broken behind sacctmgr. It only appears when
    using -vv options nevertheless.
    bouthilx committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    f734fb3 View commit details
    Browse the repository at this point in the history
  16. Add automatic script for cluster verification

    Adding a script to do automatic verifications to assert validity of the
    current code.
    
    The verifications are not automatic unit-tests, they need automatically
    checks that the process executed successfully, but the administrator
    still needs to verify manually, reading the logs, that the requested
    resources were provided.
    
    Verifications can easily be combined, building on top of each others,
    from complex ones to simpler ones.
    
    Here is a list of all the verification currently implemented for slurm
    clusters:
    
     1. very_simple_task                              (1  CPU)
     2. verify_simple_task_with_one_gpu               (1  CPU 1 GPU)
     3. verify_simple_task_with_many_gpus             (1  CPU X GPU)
     4. verify_many_task                              (X  CPU)
     5. verify_many_task_with_many_cores              (XY CPU)
     6. verify_many_task_with_one_gpu                 (X CPU X GPU)
     7. verify_many_task_with_many_gpus               (X CPU Y GPU)
     8. verify_simple_task_with_autoresume_unneeded   (1 CPU)
     9. verify_simple_task_with_autoresume_needed     (1 CPU)
    10. verify_many_task_with_autoresume_needed       (X CPU)
    bouthilx committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    4506887 View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    02845e0 View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    2d6e6fd View commit details
    Browse the repository at this point in the history
  19. Configuration menu
    Copy the full SHA
    f967180 View commit details
    Browse the repository at this point in the history
  20. Make get_launcher return None when no launcher

    My initial though was that get_launcher should raise an error when no
    launcher is found on the system since there cannot be any job launcher.
    I realized that this would break the --doNotLaunch option that users may
    want to use on system with no launcher, just to create the files.
    bouthilx committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    8c655b4 View commit details
    Browse the repository at this point in the history
  21. Updated README

    aalitaiga authored and bouthilx committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    998f3ba View commit details
    Browse the repository at this point in the history
  22. Set properly account in verify_graham

    The tests were failing because the account was not specified.
    bouthilx committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    a3c08c8 View commit details
    Browse the repository at this point in the history
  23. Set properly account in verify_cedar

    The tests were failing because the account was not specified
    bouthilx committed Oct 16, 2017
    Configuration menu
    Copy the full SHA
    9fb5ab6 View commit details
    Browse the repository at this point in the history

Commits on Oct 17, 2017

  1. Fix walltime_to_seconds convertion

    There was a missing parentheses which was causing a bad conversion of
    "DD:HH:MM:SS" to seconds.
    
    The unit-test was also missing the same parentheses. I added a unit-test
    to make sure such error could not occur again.
    bouthilx committed Oct 17, 2017
    Configuration menu
    Copy the full SHA
    1dea0d8 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    cac2f08 View commit details
    Browse the repository at this point in the history