GitHub - klmilam/columbia-ad-17

Setup

Set up Python environment

python3 -m venv venv
source ./venv/bin/activate
pip3 install -r requirements.txt

Set up GCP credentials

gcloud auth login
gcloud auth application-default login
export GOOGLE_APPLICATION_CREDENTIALS=<PATH to GCS Key for gs://columbia-dl-storage-bucket>

Preprocessing

Set Constants

BUCKET=gs://[GCS Bucket for TFRecord output]
NOW="$(date +%Y%m%d%H%M%S)"
OUTPUT_DIR="${BUCKET}/output_data/${NOW}"

Run locally with Dataflow

When testing or debugging a Dataflow pipeline, it's easier to run the pipeline locally first. Due to the memory and computation requirements of the full dataset, the dataset is limited to just 100 files when running locally.

cd preprocessor
python3 -m run_preprocessing --output_dir "${OUTPUT_DIR}"
cd ..

Run on the Cloud with Dataflow

cd preprocessor
python3 -m run_preprocessing --cloud  --output_dir "${OUTPUT_DIR}"
cd ..

Training

The model code be run using either the ctpu tool or Cloud AI Platform with minimal code changes.

Training using the ctpu tool

Install cptu tool

curl -O https://dl.google.com/cloud_tpu/ctpu/latest/darwin/ctpu && chmod a+x ctpu

Deploy a v3-8 TPU

You can use the ctpu tool to deploy a Google Compute Engine (GCE) TPU.

The following commands will open port 22 (allowing you to SSH) and create a TPU and CPU with the given name. If the TPU and/or CPU of the given name already exist, you'll just SSH into the existing ones.

gcloud compute firewall-rules create ctpu-ssh --allow=tcp:22 --source-ranges=0.0.0.0/0 \
    --network=default
./ctpu up --tpu-size=v3-8 --preemptible --zone=us-central1-a --name=kmilam-tpu

Clone the model code onto the VM

Since you're SSH'd into a VM, you need to clone your code onto the VM.

If you reuse the same name and do not delete your CPU between uses, your code will remain on the CPU.

git clone https://github.com/klmilam/columbia-ad-17.git
cd columbia-ad-17

Start training

python3 -m trainer.task

Set Constants

NOW="$(date +%Y%m%d%H%M%S)"
INPUT_DIR=${OUTPUT_DIR}
MODEL_DIR=${BUCKET}/model/${NOW}
STAGING_DIR = ${BUCKET}/staging

Training using Cloud AI Platform

Cloud AI Platform is a managed service for training machine learning models. This means that we do not deploy TPU/CPU resources; this is managed by the service.

gcloud ai-platform jobs submit training "tpu_training_$(date +%Y%m%d%H%M%S)" \
        --staging-bucket ${STAGING_DIR} \
        --config config.yaml \
        --module-name trainer.task \
        --package-path trainer/ \
        --region us-central1 \
        --input-dir ${INPUT_DIR} \
        --model-dir ${MODEL_DIR}

Train on v2-8 TPU

If v3-8 TPU resources are insufficient, try running the model on a v2-8 TPU. This will have the same number of shards as the v3-8 TPU, so no code changes (i.e. changing hyperparameters) are necessary.

gcloud ai-platform jobs submit training "tpu_training_$(date +%Y%m%d%H%M%S)" \
        --staging-bucket ${STAGING_DIR} \
        --runtime-version 1.14 \
        --python-version 3.5 \
        --scale-tier BASIC_TPU \
        --module-name trainer.task \
        --package-path trainer/ \
        --region us-central1 \
        --input-dir ${INPUT_DIR} \
        --model-dir ${MODEL_DIR}

Hyperparameter Tuning

Cloud AI Platform offers built-in support for hyperparameter tuning.

We'll use a v2-8 TPU for hyperparameter tuning, since we'll need multiple TPUs for each hptuning trial. Ideally, we would run more than 2 trails in parallel. However, we only have quota for 16 TPU V2s, so we can only run 2 concurrent trials (each on a v2-8 TPU).

gcloud ai-platform jobs submit training "tpu_training_$(date +%Y%m%d%H%M%S)" \
        --staging-bucket ${STAGING_DIR} \
        --config hptuning.yaml \
        --runtime-version 1.14 \
        --python-version 3.5 \
        --scale-tier BASIC_TPU \
        --module-name trainer.task \
        --package-path trainer/ \
        --region us-central1 \
        --input-dir ${INPUT_DIR} \
        --model-dir ${MODEL_DIR}

Note: This is not an officially supported Google product

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
preprocessor		preprocessor
trainer		trainer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
hptuning.yaml		hptuning.yaml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

Set up Python environment

Set up GCP credentials

Preprocessing

Set Constants

Run locally with Dataflow

Run on the Cloud with Dataflow

Training

Training using the ctpu tool

Install cptu tool

Deploy a v3-8 TPU

Clone the model code onto the VM

Start training

Set Constants

Training using Cloud AI Platform

Train on v2-8 TPU

Hyperparameter Tuning

About

Releases

Packages

Languages

License

klmilam/columbia-ad-17

Folders and files

Latest commit

History

Repository files navigation

Setup

Set up Python environment

Set up GCP credentials

Preprocessing

Set Constants

Run locally with Dataflow

Run on the Cloud with Dataflow

Training

Training using the ctpu tool

Install cptu tool

Deploy a v3-8 TPU

Clone the model code onto the VM

Start training

Set Constants

Training using Cloud AI Platform

Train on v2-8 TPU

Hyperparameter Tuning

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages