Skip to content

Commit

Permalink
(1) Update binary version to 0.7.1. Re-run all case studies and updat…
Browse files Browse the repository at this point in the history
…ed information.

Note that previously there was a typo in WGS case study -- instead of 209m 53s, it was 309m 53s.
(2) Remove BIN_VERSION in the *_binaries.sh script because it was unused.
(3) Update a few instructions in training case study.

PiperOrigin-RevId: 220115965
  • Loading branch information
pichuan committed Nov 6, 2018
1 parent 2040913 commit aba5553
Show file tree
Hide file tree
Showing 10 changed files with 52 additions and 72 deletions.
12 changes: 6 additions & 6 deletions docs/deepvariant-case-study.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,12 +138,12 @@ fewer cores for this step.
## Resources used by each step

Step | wall time
---------------------------------- | ------------------
`make_examples` | 113m 12s
`call_variants` | 176m 30s
`postprocess_variants` (no gVCF) | 20m 11s
`postprocess_variants` (with gVCF) | 51m 25s
total time (single machine) | 209m 53s - 341m 7s
---------------------------------- | -------------------
`make_examples` | 113m 19s
`call_variants` | 181m 40s
`postprocess_variants` (no gVCF) | 20m 40s
`postprocess_variants` (with gVCF) | 54m 49s
total time (single machine) | 315m 39s - 349m 48s

## Variant call quality

Expand Down
12 changes: 6 additions & 6 deletions docs/deepvariant-exome-case-study.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,12 +91,12 @@ More discussion can be found in the
## Resources used by each step

Step | wall time
---------------------------------- | ---------
`make_examples` | 13m 38s
`call_variants` | 1m 55s
`postprocess_variants` (no gVCF) | 0m 12s
`postprocess_variants` (with gVCF) | 1m 17s
total time (single machine) | ~17m
---------------------------------- | -----------------
`make_examples` | 13m 39s
`call_variants` | 2m 0s
`postprocess_variants` (no gVCF) | 0m 13s
`postprocess_variants` (with gVCF) | 1m 18s
total time (single machine) | 15m 52s - 17m 10s

## Variant call quality

Expand Down
2 changes: 1 addition & 1 deletion docs/deepvariant-quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ Before you start running, you need to have the following input files:
1. A model checkpoint for DeepVariant. We'll refer to this as `${MODEL}` below.

```bash
BIN_VERSION="0.7.0"
BIN_VERSION="0.7.1"
MODEL_VERSION="0.7.0"

MODEL_NAME="DeepVariant-inception_v3-${MODEL_VERSION}+data-wgs_standard"
Expand Down
92 changes: 37 additions & 55 deletions docs/deepvariant-tpu-training-case-study.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ YOUR_PROJECT=REPLACE_WITH_YOUR_PROJECT
OUTPUT_GCS_BUCKET=REPLACE_WITH_YOUR_GCS_BUCKET
BUCKET="gs://deepvariant"
BIN_VERSION="0.7.0"
BIN_VERSION="0.7.1"
MODEL_VERSION="0.7.0"
MODEL_BUCKET="${BUCKET}/models/DeepVariant/${MODEL_VERSION}/DeepVariant-inception_v3-${MODEL_VERSION}+data-wgs_standard"
Expand Down Expand Up @@ -142,7 +142,7 @@ sudo docker pull gcr.io/deepvariant-docker/deepvariant:"${BIN_VERSION}"
) >"${LOG_DIR}/training_set.with_label.make_examples.log" 2>&1
```

This took 107m50.530s. We will want to shuffle this on Dataflow later, so I will
This took 107m8.521s. We will want to shuffle this on Dataflow later, so I will
copy it to GCS bucket first:

```
Expand Down Expand Up @@ -170,7 +170,7 @@ gsutil -m cp ${OUTPUT_DIR}/training_set.with_label.tfrecord-?????-of-00064.gz \
) >"${LOG_DIR}/validation_set.with_label.make_examples.log" 2>&1
```

This took: 8m10.566s.
This took: 8m49.066s.

Validation set is small here. We will just shuffle locally later, so no need to
copy to out GCS bucket.
Expand All @@ -193,7 +193,7 @@ copy to out GCS bucket.
) >"${LOG_DIR}/test_set.no_label.make_examples.log" 2>&1
```

This took: 2m17.151s.
This took: 2m14.576s.

We don't need to shuffle test set. It will eventually be used in the final
evaluation evaluated with `hap.py` on the whole set.
Expand Down Expand Up @@ -226,36 +226,22 @@ Here is an example. You might or might not need to install everything below:

```
sudo apt -y install python-dev python-pip
pip install --upgrade pip
pip install --user --upgrade virtualenv
```

A virtual environment is a directory tree containing its own Python
distribution. To create a virtual environment, create a directory and run:
# This will make sure the pip command is bound to python2
python2 -m pip install --user --upgrade --force-reinstall pip
export PATH="$HOME/.local/bin:$PATH"
```
virtualenv ${HOME}/virtualenv_beam
```

A virtual environment needs to be activated for each shell that is to use it.
Activating it sets some environment variables that point to the virtual
environment's directories.

To activate a virtual environment in Bash, run:
Install Beam:

```
. ${HOME}/virtualenv_beam/bin/activate
```

Once this is activated, install Beam:

```
pip install apache-beam
pip install --user apache-beam
```

Get the code that shuffles:

```
mkdir -p ${SHUFFLE_SCRIPT_DIR}
wget https://raw.githubusercontent.com/google/deepvariant/r0.7/tools/shuffle_tfrecords_beam.py -O ${SHUFFLE_SCRIPT_DIR}/shuffle_tfrecords_beam.py
```

Expand All @@ -280,7 +266,7 @@ Output is in

Data config file is in `${OUTPUT_DIR}/validation_set.dataset_config.pbtxt`.

This took 11m15.090s.
This took 10m40.558s.

For the training set, it is too large to be running with DirectRunner on this
instance, so we use the DataflowRunner. Before that, please make sure you enable
Expand All @@ -290,7 +276,7 @@ http://console.cloud.google.com/flows/enableapi?apiid=dataflow.
Then, install Dataflow:

```
pip install google-cloud-dataflow
pip install --user google-cloud-dataflow
```

Shuffle using Dataflow.
Expand Down Expand Up @@ -320,13 +306,7 @@ In order to have the best performance, you might need extra resources such as
machines or IPs within a region. That will not be in the scope of this case
study here.

My run took about 38m3.435s on Dataflow.

After this is done, deactivate the virtualenv:

```
deactivate
```
My run took about 40m28.401s on Dataflow.

The output path can be found in the dataset_config file by:

Expand All @@ -345,11 +325,12 @@ In the output, the `tfrecord_path` should be valid paths in gs://.
name: "HG001"
tfrecord_path: "YOUR_GCS_BUCKET/training_set.with_label.shuffled-?????-of-?????.tfrecord.gz"
num_examples: 3890596
num_examples: 3857898
```

In my run, it wrote to 341 shards:
`${OUTPUT_BUCKET}/training_set.with_label.shuffled-?????-of-00341.tfrecord.gz`
`${OUTPUT_BUCKET}/training_set.with_label.shuffled-?????-of-00364.tfrecord.gz`

### Start a Cloud TPU

Expand All @@ -366,17 +347,18 @@ Here is what I did to start a TPU.
First, check all existing TPUs by running this command:

```
gcloud beta compute tpus list --zone=us-central1-f
gcloud compute tpus list --zone=us-central1-f
```

In my case, I don't see any existing TPUs.

Then, I ran the following command to start a TPU:

```
time gcloud beta compute tpus create ${USER}-demo-tpu \
--range=10.240.2.0/29 \
--version=1.9 \
time gcloud compute tpus create ${USER}-demo-tpu \
--network=default \
--range=10.240.1.0/29 \
--version=1.11 \
--zone=us-central1-f
```

Expand All @@ -390,21 +372,21 @@ This command took about 5min to finish.
After the TPU is created, we can query it by:

```
gcloud beta compute tpus list --zone=us-central1-f
gcloud compute tpus list --zone=us-central1-f
```

In my case, I see:

```
NAME ZONE ACCELERATOR_TYPE NETWORK_ENDPOINTS NETWORK RANGE STATUS
pichuan-demo-tpu us-central1-f v2-8 10.240.2.2:8470 default 10.240.2.0/29 READY
NAME ZONE ACCELERATOR_TYPE NETWORK_ENDPOINTS NETWORK RANGE STATUS
pichuan-demo-tpu us-central1-f v2-8 10.240.1.2:8470 default 10.240.1.0/29 READY
```

In this example, I set up these variables:

```
export TPU_NAME="${USER}-demo-tpu"
export TPU_IP="10.240.2.2"
export TPU_IP="10.240.1.2"
```

(One more reminder about
Expand Down Expand Up @@ -436,7 +418,7 @@ Pointers for common issues or things you can tune:

1. TPU might not have write access to GCS bucket:

https://cloud.google.com/tpu/docs/storage-buckets#giving_your_product_name_short_access_to_gcs_name_short
https://cloud.google.com/tpu/docs/storage-buckets#storage_access

1. Change `save_interval_secs` to save checkpoints more frequently:

Expand Down Expand Up @@ -467,9 +449,9 @@ sudo docker run \
--batch_size=512 > "${LOG_DIR}/eval.log" 2>&1 &
```

`model_eval` will watch the `${TRAINING_DIR}` and start evaluting when there are
newly saved checkpoints. It evaluates the checkpoints on the data specified in
`validation_set.dataset_config.pbtxt`, and saves `*metrics` file to the
`model_eval` will watch the `${TRAINING_DIR}` and start evaluating when there
are newly saved checkpoints. It evaluates the checkpoints on the data specified
in `validation_set.dataset_config.pbtxt`, and saves `*metrics` file to the
directory. These files are used later to pick the best model based on how
accurate they are on the validation set.

Expand All @@ -483,7 +465,7 @@ kill the process after training is no longer producing more checkpoints. And,
command to delete the TPU:

```
gcloud beta compute tpus delete ${TPU_NAME} --zone us-central1-f
gcloud compute tpus delete ${TPU_NAME} --zone us-central1-f
```

### Use TensorBoard to visualize progress
Expand All @@ -510,7 +492,7 @@ tensorboard --logdir ${TRAINING_DIR} --port=8080
This gives some message like:

```
TensorBoard 1.9.0 at http://cs-6000-devshell-vm-ddb3cd66-9d0b-4e19-afcc-d4a19ba2ee06:8080 (Press CTRL+C to quit)
TensorBoard 1.11.0 at http://cs-6000-devshell-vm-ec39a769-4665-4f57-bdff-2c9192f44b7e:8080 (Press CTRL+C to quit)
```

But that link is not usable directly. I clicked on the “Web Preview” on the top
Expand Down Expand Up @@ -538,7 +520,7 @@ things. In my run, I took these screenshots after the run completed:
When you are done with training, make sure to clean up the TPU:

```
gcloud beta compute tpus delete ${TPU_NAME} --zone us-central1-f
gcloud compute tpus delete ${TPU_NAME} --zone us-central1-f
```

### Pick a model
Expand All @@ -562,11 +544,11 @@ python ${SHUFFLE_SCRIPT_DIR}/print_f1.py \
The top line I got was this:

```
43600 96769.0 0.998961331563
44200 96772.0 0.998945823601
```

This means the model checkpoint that performs the best on the validation set is
`${TRAINING_DIR}/model.ckpt-43600`. Based on this result, a few thoughts came
`${TRAINING_DIR}/model.ckpt-44200`. Based on this result, a few thoughts came
into mind:

1. Training more steps didn't seem to help much. Did the training overfit?
Expand All @@ -587,7 +569,7 @@ run on CPUs:
/opt/deepvariant/bin/call_variants \
--outfile "${OUTPUT_DIR}/test_set.cvo.tfrecord.gz" \
--examples "${OUTPUT_DIR}/test_set.no_label.tfrecord@${N_SHARDS}.gz" \
--checkpoint "${TRAINING_DIR}/model.ckpt-43600" \
--checkpoint "${TRAINING_DIR}/model.ckpt-44200" \
) >"${LOG_DIR}/test_set.call_variants.log" 2>&1 &
```

Expand Down Expand Up @@ -635,8 +617,8 @@ To summarize, the accuracy is:

Type | # FN | # FP | Recall | Precision | F1\_Score
----- | ---- | ---- | -------- | --------- | ---------
INDEL | 225 | 136 | 0.977552 | 0.986827 | 0.982167
SNP | 66 | 49 | 0.999004 | 0.999260 | 0.999132
INDEL | 229 | 141 | 0.977153 | 0.986343 | 0.981726
SNP | 71 | 58 | 0.998928 | 0.999125 | 0.999026

The baseline we're comparing to is to directly use the WGS model (`--checkpoint
${GCS_PRETRAINED_WGS_MODEL}`) to make the calls.
Expand Down
Binary file modified docs/images/TensorBoardAccuracy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/images/TensorBoardSpeed.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 0 additions & 1 deletion scripts/run_wes_case_study_binaries.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ set -euo pipefail
## Preliminaries
# Set a number of shell variables, to make what follows easier to read.
BASE="${HOME}/exome-case-study"
BIN_VERSION="0.7.0"
MODEL_VERSION="0.7.0"
MODEL_NAME="DeepVariant-inception_v3-${MODEL_VERSION}+data-wes_standard"
MODEL_HTTP_DIR="https://storage.googleapis.com/deepvariant/models/DeepVariant/${MODEL_VERSION}/${MODEL_NAME}"
Expand Down
2 changes: 1 addition & 1 deletion scripts/run_wes_case_study_docker.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ set -euo pipefail
## Preliminaries
# Set a number of shell variables, to make what follows easier to read.
BASE="${HOME}/exome-case-study"
BIN_VERSION="0.7.0"
BIN_VERSION="0.7.1"
MODEL_VERSION="0.7.0"
MODEL_NAME="DeepVariant-inception_v3-${MODEL_VERSION}+data-wes_standard"
MODEL_HTTP_DIR="https://storage.googleapis.com/deepvariant/models/DeepVariant/${MODEL_VERSION}/${MODEL_NAME}"
Expand Down
1 change: 0 additions & 1 deletion scripts/run_wgs_case_study_binaries.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ set -euo pipefail
## Preliminaries
# Set a number of shell variables, to make what follows easier to read.
BASE="${HOME}/case-study"
BIN_VERSION="0.7.0"
MODEL_VERSION="0.7.0"
MODEL_NAME="DeepVariant-inception_v3-${MODEL_VERSION}+data-wgs_standard"
MODEL_HTTP_DIR="https://storage.googleapis.com/deepvariant/models/DeepVariant/${MODEL_VERSION}/${MODEL_NAME}"
Expand Down
2 changes: 1 addition & 1 deletion scripts/run_wgs_case_study_docker.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ set -euo pipefail
## Preliminaries
# Set a number of shell variables, to make what follows easier to read.
BASE="${HOME}/case-study"
BIN_VERSION="0.7.0"
BIN_VERSION="0.7.1"
MODEL_VERSION="0.7.0"
MODEL_NAME="DeepVariant-inception_v3-${MODEL_VERSION}+data-wgs_standard"
MODEL_HTTP_DIR="https://storage.googleapis.com/deepvariant/models/DeepVariant/${MODEL_VERSION}/${MODEL_NAME}"
Expand Down

0 comments on commit aba5553

Please sign in to comment.