In this example tutorial, we will show you how to train a PyTorch RNN MNIST neural network model with Bacalhau. PyTorch is a framework developed by Facebook AI Research for deep learning, featuring both beginner-friendly debugging tools and a high level of customization for advanced users, with researchers and practitioners using it across companies like Facebook and Tesla. Applications include computer vision, natural language processing, cryptography, and more.
bacalhau docker run \
--gpu 1 \
--timeout 3600 \
--wait-timeout-secs 3600 \
--wait \
--id-only \
pytorch/pytorch \
-w /outputs \
-i ipfs://QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw:/data \
-i https://raw.githubusercontent.com/pytorch/examples/main/mnist_rnn/main.py \
-- python ../inputs/main.py --save-model
To get started, you need to install the Bacalhau client, see more information here
To train our model locally, we will start by cloning the Pytorch examples repo:
git clone https://github.com/pytorch/examples
Install the following:
pip install --upgrade torch torchvision
Next, we run the command below to begin the training of the mnist_rnn
model. We added the --save-model
flag to save the model
python ./examples/mnist_rnn/main.py --save-model
Next, the downloaded MNIST dataset is saved in the data
folder.
Now that we have downloaded our dataset, the next step is to upload it to IPFS. The simplest way to upload the data to IPFS is to use a third-party service to "pin" data to the IPFS network, to ensure that the data exists and is available. To do this you need an account with a pinning service like Pinata or NFT.Storage. Once registered you can use their UI or API or SDKs to upload files.
Once you have uploaded your data, you'll be finished copying the CID. Here is the dataset we have uploaded.
After the repo image has been pushed to Docker Hub, we can now use the container for running on Bacalhau. To submit a job, run the following Bacalhau command:
export JOB_ID=$(bacalhau docker run \
--gpu 1 \
--timeout 3600 \
--wait-timeout-secs 3600 \
--wait \
--id-only \
pytorch/pytorch \
-w /outputs \
-i ipfs://QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw:/data \
-i https://raw.githubusercontent.com/pytorch/examples/main/mnist_rnn/main.py \
-- python ../inputs/main.py --save-model)
export JOB_ID=$( ... )
exports the job ID as environment variablebacalhau docker run
: call to bacalhau- The
--gpu 1
flag is set to specify hardware requirements, a GPU is needed to run such a job pytorch/pytorch
: Using the official pytorch Docker image- The
-i ipfs://QmdeQjz1HQQd.....
: flag is used to mount the uploaded dataset - The
-i https://raw.githubusercontent.com/py..........
: flag is used to mount our training script. We will use the URL to this Pytorch example -w /outputs:
Our working directory is /outputs. This is the folder where we will save the model as it will automatically get uploaded to IPFS as outputspython ../inputs/main.py --save-model
: URL script gets mounted to the/inputs
folder in the container
When a job is submitted, Bacalhau prints out the related job_id
. We store that in an environment variable so that we can reuse it later on.
The same job can be presented in the declarative format. In this case, the description will look like this:
name: Stable Diffusion Dreambooth Finetuning
type: batch
count: 1
tasks:
- name: My main task
Engine:
type: docker
params:
Image: "pytorch/pytorch"
Entrypoint:
- /bin/bash
Parameters:
- -c
- python ../inputs/main.py --save-model
InputSources:
- Source:
Type: "ipfs"
Params:
CID: "QmdeQjz1HQQdT9wT2NHX86Le9X6X6ySGxp8dfRUKPtgziw"
Target: /data
- Source:
Type: urlDownload
Params:
URL: https://raw.githubusercontent.com/pytorch/examples/main/mnist_rnn/main.py
Target: /inputs
Resources:
GPU: "1"
The job description should be saved in .yaml
format, e.g. torch.yaml
, and then run with the command:
bacalhau job run torch.yaml
You can check the status of the job using bacalhau job list
.
bacalhau job list --id-filter ${JOB_ID}
When it says Completed
, that means the job is done, and we can get the results.
You can find out more information about your job by using bacalhau job describe
.
bacalhau job describe ${JOB_ID}
You can download your job results directly by using bacalhau job get
. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory.
rm -rf results && mkdir -p results
bacalhau job get $JOB_ID --output-dir results
After the download has finished you should see the following contents in results directory
Now you can find results in the results/outputs
folder. To view them, run the following command:
ls results/ # list the contents of the current directory
cat results/stdout # displays the contents of the file given to it as a parameter.
ls results/outputs/ # list the successfully trained model