python3 -m venv venv
source ./venv/bin/activate
pip3 install -r requirements.txt
gcloud auth login
gcloud auth application-default login
export GOOGLE_APPLICATION_CREDENTIALS=<PATH to GCS Key for gs://columbia-dl-storage-bucket>
BUCKET=gs://[GCS Bucket for TFRecord output]
NOW="$(date +%Y%m%d%H%M%S)"
OUTPUT_DIR="${BUCKET}/output_data/${NOW}"
When testing or debugging a Dataflow pipeline, it's easier to run the pipeline locally first. Due to the memory and computation requirements of the full dataset, the dataset is limited to just 100 files when running locally.
cd preprocessor
python3 -m run_preprocessing --output_dir "${OUTPUT_DIR}"
cd ..
cd preprocessor
python3 -m run_preprocessing --cloud --output_dir "${OUTPUT_DIR}"
cd ..
The model code be run using either the ctpu tool or Cloud AI Platform with minimal code changes.
curl -O https://dl.google.com/cloud_tpu/ctpu/latest/darwin/ctpu && chmod a+x ctpu
You can use the ctpu tool to deploy a Google Compute Engine (GCE) TPU.
The following commands will open port 22 (allowing you to SSH) and create a TPU and CPU with the given name
. If the TPU and/or CPU of the given name
already exist, you'll just SSH into the existing ones.
gcloud compute firewall-rules create ctpu-ssh --allow=tcp:22 --source-ranges=0.0.0.0/0 \
--network=default
./ctpu up --tpu-size=v3-8 --preemptible --zone=us-central1-a --name=kmilam-tpu
Since you're SSH'd into a VM, you need to clone your code onto the VM.
If you reuse the same name
and do not delete your CPU between uses, your code will remain on the CPU.
git clone https://github.com/klmilam/columbia-ad-17.git
cd columbia-ad-17
python3 -m trainer.task
NOW="$(date +%Y%m%d%H%M%S)"
INPUT_DIR=${OUTPUT_DIR}
MODEL_DIR=${BUCKET}/model/${NOW}
STAGING_DIR = ${BUCKET}/staging
Cloud AI Platform is a managed service for training machine learning models. This means that we do not deploy TPU/CPU resources; this is managed by the service.
gcloud ai-platform jobs submit training "tpu_training_$(date +%Y%m%d%H%M%S)" \
--staging-bucket ${STAGING_DIR} \
--config config.yaml \
--module-name trainer.task \
--package-path trainer/ \
--region us-central1 \
--input-dir ${INPUT_DIR} \
--model-dir ${MODEL_DIR}
If v3-8 TPU resources are insufficient, try running the model on a v2-8 TPU. This will have the same number of shards as the v3-8 TPU, so no code changes (i.e. changing hyperparameters) are necessary.
gcloud ai-platform jobs submit training "tpu_training_$(date +%Y%m%d%H%M%S)" \
--staging-bucket ${STAGING_DIR} \
--runtime-version 1.14 \
--python-version 3.5 \
--scale-tier BASIC_TPU \
--module-name trainer.task \
--package-path trainer/ \
--region us-central1 \
--input-dir ${INPUT_DIR} \
--model-dir ${MODEL_DIR}
Cloud AI Platform offers built-in support for hyperparameter tuning.
We'll use a v2-8 TPU for hyperparameter tuning, since we'll need multiple TPUs for each hptuning trial. Ideally, we would run more than 2 trails in parallel. However, we only have quota for 16 TPU V2s, so we can only run 2 concurrent trials (each on a v2-8 TPU).
gcloud ai-platform jobs submit training "tpu_training_$(date +%Y%m%d%H%M%S)" \
--staging-bucket ${STAGING_DIR} \
--config hptuning.yaml \
--runtime-version 1.14 \
--python-version 3.5 \
--scale-tier BASIC_TPU \
--module-name trainer.task \
--package-path trainer/ \
--region us-central1 \
--input-dir ${INPUT_DIR} \
--model-dir ${MODEL_DIR}
Note: This is not an officially supported Google product