Skip to content

Latest commit

 

History

History
107 lines (88 loc) · 4.51 KB

README.md

File metadata and controls

107 lines (88 loc) · 4.51 KB

GPUPeasy

GPU based parameters sweeps made easy.

This is a simple script that I made that to help schedule parameter sweeps on one or more GPUs automatically. This is particularly useful when jobs have varying running time and manually scheduling them is a pain. Note that I only support CUDA_VISIBLE_DEVICES.

I don't currently support scheduling over multiple machines.

Runs on python3. Based on Flask mainly.

I don't care what the job you want to schedule are as long as they are valid shell commands. I just pass them onto subprocess to execute after specifying which GPU to use.

Installation

This project is not mature enough for a release, though I use this daily in my work-flow. If you are interested in using this, please clone this repository and run,

pip install -e .

From the project root to install the gpupeasy package. Preferably, activate a virtual environment if you are into those sorts of things.

Once you have installed the package, you need to run the scheduler server (daemon) by

python startserver.py

This will start the scheduler server on 0.0.0.0:8888 where it will await jobs to schedule. By default, the back-end server assumes you have 4 CUDA devices (GPUs). These defaults can be modified in startserver.py.

Jobs can be scheduled through the provided web gui.

python startgui.py

Will start the gui server on 0.0.0.0:4004. Adding jobs from there should be straightforward.

What are Jobs?

Jobs are pretty simple, they have a name (jobname), a file where they dump their outputs (outfile) and a shell command (python -u blah.py, ls -l). Each job execution will consume one gpu till it is finished and dump its stdout and stderr to outfile.

Bug Reports

  • Since the scheduler opens the output file before scheduling a job, all scheduled processes have a open file descriptor. This is bad, as these then are never closed and they overshoot the system open file descriptor limit.

Features to Add

Important

  • Support for changing devices. Adding/Subtracting GPU without killing the server. Ask the user to delete GPU objects and add new objects. So that I don't have to worry about killing/migrating processes. We can just kill existing processes or let them finish and not schedule any more on the deleted gpu.
  • Jobs read python file from disk on when the run starts. If the file changes between the enqueue and service, we load the new file. This could be a problem if the new file has bugs. Copy to a cash directory and run from there?

Not so critical

  • Depricate .peasy files in favour of .csv files. Since we now have a way of parsing commands from csv file, we can support better use-cases. There is no need to support .peasy syntax if the scheduler implements the csv to .peasy conversion (see current gridgen).

  • Kill a job

  • Pause scheduler (very useful when all the scheduled jobs have an error)

  • Redo the schedule-job process such that it is amenable to network scheduling in the future.

  • Support for checkpointing.

  • Support for prioritizing jobs. This can be done by implementing priority queues. Replace the current list queue with priority queue that reduces to a regular queue with when priority = 30 or something.

  • Implement a screen where you can view jobinfo and logs.

  • Need to fix names and conventions throughout.

  • Fix STDOUT overwriting STDERR error.

  • Implement better scopping.

  • !MAJOR TODO: Should you move to runc based setup?

  • !IMPORTANT : Move to proj-name/job-name/job-id/ like job scoping to better handle testing runs etc. (We do proj-name/job-name-id/)

  • Allow modifying the font end queue through gui -- currently it lists all and every node that was ever started.

  • Usecase 1: One of the biggest reason you cannot keep GPU peasy persistent is that there is no way to kill rouge jobs (crated by say, deleting the job files or those that are in an infinite loop). This requires us to kill gpu-peasy server.

    Use case 2: Also, sometimes you accidentally kill certain jobs (because you accidentally changes the underlying python file without realizaing a job was using it). There could be other reasons why a job might have failed -- errors in the code while debugging and it makes sense to implement a rerun feature.

    1. On the UI, display PID for each job.
    2. Implement a heart-beat check that removes jobs from queues if an external kill event occurs (say I do kill -9 PID).