An example of Grid.ai running Ray in the model. The examples will show how to:
- Get started with Development Setup
- Unit test by running experiment locally
- Run on Grid.ai Cloud with zero code modification
- Advanced Dockerfile usage on Grid.ai
- Use Grid.ai when the model is not on GitHub
- Troubleshooting Tips
- Setup development environment
# Grid.ai minimum is python=3.8
conda create --name ray python=3.8
conda activate ray
# Python modules required
cat >requirements.txt <<EOF
ray
ray[tune]
ray[default]
pandas
tabulate
tensorboardX
EOF
# Install Python modules for the experiment
pip install --ignore-requires-python -v -r requirements.txt
# Install Python modules for the Grid
pip install lightning-grid --upgrade
python ray-tune-quickstart.py
- Login into Grid.ai
grid login
grid run ray-tune-quickstart.py
Use Grid.ai with GitHub and Dockerfile examples by using customized container with --dockerfile gridray.dockerfile flag.
- Run using manually specifying the Dockerfile. Use CLI below.
grid run --dockerfile gridray.dockerfile --name ray-dk-$(date '+%m%d-%H%M%S') ray-tune-quickstart.py
- Use spot instance and override Run name with
ray-MMDD-HHMMSS
for easier search later. Use CLI below.
grid run --dockerfile gridray.dockerfile --use_spot --name ray-sp-dk-$(date '+%m%d-%H%M%S') ray-tune-quickstart.py
Using --localdir
does not allow the Grid.ai cloning feature.
- Let Grid.ai build the container
grid run --name ray-local-$(date '+%m%d-%H%M%S') --localdir ray-tune-quickstart.py
- Use the container specification
grid run --dockerfile gridray.dockerfile --use_spot --name ray-sp-dk-lc-$(date '+%m%d-%H%M%S') --localdir ray-tune-quickstart.py
- Review
grid history
grid history | grep -e Run -e ray -e $(date '+%Y-%m-%d')
┃ Run ┃ Created At ┃ Experiments ┃ Failed ┃ Stopped ┃ Completed ┃
│ ray-sp-dk-lc-0720-105956 │ 2021-07-20 15:00:09+0000 │ 1 │ 0 │ 0 │ 1 │
│ ray-local-0720-105916 │ 2021-07-20 14:59:30+0000 │ 1 │ 0 │ 0 │ 1 │
│ ray-sp-dk-0720-105713 │ 2021-07-20 14:57:25+0000 │ 1 │ 0 │ 0 │ 1 │
│ ray-dk-0720-105640 │ 2021-07-20 14:56:53+0000 │ 1 │ 0 │ 0 │ 1 │
│ fervent-tamarin-146 │ 2021-07-20 14:55:39+0000 │ 1 │ 0 │ 0 │ 1 │
- Review
grid status
for run in $(grid history | grep -e Run -e ray -e $(date '+%Y-%m-%d') | awk -F'│' '{print $2}'); do
echo $run
grid status $run
done
ray-sp-dk-lc-0720-105956
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Experiment ┃ Command ┃ Status ┃ Duration ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ ray-sp-dk-lc-0720-105956-exp0 │ ray-tune-quickstart.py] │ succeeded │ 0d-00:01:28 │
└───────────────────────────────┴─────────────────────────┴───────────┴─────────────┘