This repo contains code and data for training classifiers on arXiv category labels. We predict the label(s) assigned to a paper using its title and abstract text. An earlier effort to do this is described here. We deploy via Airflow using the DAG defined in arxiv_classifier_dag.py.
- We extracted arXiv paper metadata from BQ (
gcp_cset_arxiv_metadata.arxiv_metadata_latest
) to GCS (gs://arxiv-classifier-v2/source/
). Seeextract.sh
. - We downloaded it to disk under
data/source
. Seedownload.sh
. - We preprocessed and partitioned the data into 70/15/15 train/eval/test splits. See
split.py
. - We make this data available for training as a 🤗 Dataset from
data/dataset/arxiv
. Seearxiv.py
.
- We used PyTorch/XLA to train on TPU.
PyTorch/XLA is a Python package that uses the XLA deep learning compiler to connect the PyTorch deep learning framework and Cloud TPUs.
- Our first abstraction layer over PyTorch/XLA is Huggingface Transformers.
- And to adapt our training script for use on TPU, we used Huggingface Accelerate.
- Weights & Biases provided monitoring.
The Cloud TPU docs have a user guide for PyTorch/XLA. The first thing it says is that you can use either TPU VMs or TPU Pods, but they (now) recommend VMs, so we used one.
We created a TPU VM arxiv-classifier-v2
in us-central1-a
(console):
gcloud compute tpus tpu-vm create arxiv-classifier-v2 \
--zone=us-central1-a \
--accelerator-type=v3-8 \
--version=tpu-vm-pt-1.13
This VM is backed by a v3.8 TPU and nominally costs $8/hour. At time of writing, v4 TPUs were limited-availability. PyTorch 1.13 is pre-installed with CUDA 11.7.
gcloud compute tpus tpu-vm start arxiv-classifier-v2 --zone=us-central1-a
gcloud compute tpus tpu-vm stop arxiv-classifier-v2 --zone=us-central1-a
Shell in like
gcloud compute tpus tpu-vm ssh arxiv-classifier-v2 --zone=us-central1-a
gcloud compute tpus
CLI requires application default credentials (ADC) set.
Otherwise, with user auth alone, you may see this:
% gcloud compute tpus tpu-vm start arxiv-classifier-v2 --zone=us-central1-a
Request issued for: [arxiv-classifier-v2]
Waiting for operation [projects/gcp-cset-projects/locations/us-central1-a/operations/operation-1669841256413-5eeb636eb799d-6070f791-4abbb44a] to complete...failed.
ERROR: (gcloud.compute.tpus.tpu-vm.start) {
"code": 8,
"message": "There is no more capacity in the zone \"us-central1-a\"; you can try in another zone where Cloud TPU Nodes are offered (see https://cloud.google.com/tpu/docs/regions) [EID: 0x191e4b746b2c69a9]"
}
The above is solved by running gcloud auth application-default login
.
~/
as the first user which aren't writeable by a second user.
Per the PyTorch/XLA user guide, we don't bother with virtualenvs on the TPU VM. This minimal example from the user guide just works:
% export XRT_TPU_CONFIG="localservice;0;localhost:51011"
% python3
Python 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torch_xla.core.xla_model as xm
>>> dev = xm.xla_device()
>>> t1 = torch.randn(3,3,device=dev)
>>> t2 = torch.randn(3,3,device=dev)
>>> print(t1 + t2)
tensor([[-1.1846, -0.7140, -0.4168],
[-0.3259, -0.5264, -0.8828],
[-0.8562, -0.5813, 0.3264]], device='xla:1')
export XRT_TPU_CONFIG="localservice;0;localhost:51011"
You may want to add that to your .profile
.
We can scp
resources to the TPU VM via gcloud compute tpus tpu-vm scp
, e.g. like this
gcloud compute tpus tpu-vm scp --recurse ./training arxiv-classifier-v2:~/ --zone=us-central1-a
To copy a local directory training
and its contents to the user directory on our TPU VM.
We have a project-specific bucket gs://arxiv-classifier-v2
.
Per the TPU docs, we created a service agent for the TPU VM to access the project bucket:
$ gcloud beta services identity create --service tpu.googleapis.com
Service identity created: [email protected]
Service agents are slightly different from service accounts and not as well-documented.
Anyway, the key thing is we granted [email protected]
the Storage Admin role for our project bucket using the console.
From the VM, we can confirm there's access to the project bucket.
touch test.tmp
gsutil cp ./test.tmp gs://arxiv-classifier-v2/
gsutil rm gs://arxiv-classifier-v2/test.tmp
Our training script is training/run.py
.
We invoke it using the Huggingface accelerate library.
Configurations for training run are in JSON files under training/configs
.
Invoking the script from training/
with a config at debug.json
looks like
export XRT_TPU_CONFIG="localservice;0;localhost:51011"
export TOKENIZERS_PARALLELISM=False
accelerate launch run.py configs/debug.json
The config JSON contains all of our arguments for the training script, which we adapted from the GLUE example script.
Before the first training run, we set up Accelerate as below using accelerate config
:
$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU [4] MPS): 3
What is the name of the function in your script that should be launched in all parallel scripts? [main]:
Are you using a TPU cluster? [yes/NO]:
How many TPU cores should be used for distributed training? [1]:
This creates a config file at ~/.cache/huggingface/accelerate/default_config.yaml
.
We then edit the values of downcast_bf16
and mixed_precision
to yes
as below:
command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: TPU
downcast_bf16: 'yes'
fsdp_config: {}
gpu_ids: null
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'yes'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false
Accelerate supports some interesting training strategies. Its "TPU Best Practices" guide is very helpful.
For example,
As you launch your script, you may notice that training seems exceptionally slow at first. This is because TPUs first run through a few batches of data to see how much memory to allocate before finally utilizing this configured memory allocation extremely efficiently.
If you notice that your evaluation code to calculate the metrics of your model takes longer due to a larger batch size being used, it is recommended to keep the batch size the same as the training data if it is too slow. Otherwise the memory will reallocate to this new batch size after the first few iterations.
Start the TPU VM if it isn't already running:
gcloud compute tpus tpu-vm start arxiv-classifier-v2 --zone=us-central1-a
Shell in:
gcloud compute tpus tpu-vm ssh arxiv-classifier-v2 --zone=us-central1-a
Clone this repo:
git clone https://github.com/georgetown-cset/arxiv-classifier-v2.git
Download the source data:
cd arxiv-classifier-v2
cd dataset/scripts
bash download.sh
Create pre-processed splits:
PYTHONPATH=../.. python3 split.py
The dataset is ready for training.
We define experiments using JSON config files in the training/configs
directory.
This file specifies parameters including:
model_name_or_path
: The name of a pretrained model from the Huggingface hub, or a local path for a pretrained model. We're usingsentence-transformers/allenai-specter
.dataset_name
: The name of a dataset defined in thearXiv.py
dataset load script, specifically in any of theBuilderConfig
classes listed in theBUILDER_CONFIGS
attribute of theArxivDataset
classes. For example:AI
or orAI Undersample 4-to-1
.run_name
: The name of the training run for monitoring in Weights & Biases.output_dir
: Where to save training outputs (checkpoints, stats, configs, and the trained model). We're using subdirectories of./training/models
.max_train_samples
: For debugging purposes, a limit on the number of examples to use from thetrain
split.max_eval_samples
: Similarly, a limit on the examples to use from thevalidation
split.evaluation_strategy
: Eithersteps
orepoch
, indicating when to evaluate the model during training.eval_steps
: Ifsteps
above, how often to evaluate (e.g.500
to evaluate after every 500 steps).
After creating a new config file:
- Commit it and push to GitHub
git pull
from the TPU VM
On the VM, we start an experiment with accelerate launch {script-path} {config-path}
.
Using the debug.json
config looks like this:
export XRT_TPU_CONFIG="localservice;0;localhost:51011"
export TOKENIZERS_PARALLELISM=False
accelerate launch training/run.py training/configs/debug.json
One way to explore the available model, training, and accelerate parameters is via
accelerate launch training/run.py --help
The ./models
directory contains a subset of the models trained selected from wandb.
Each of these models was trained on all of arXiv.
- ai-32: AI classifier.
- ai-full-weak-negs: AI classifier trained on arXiv and OpenAlex weak-negative examples 10% the size of arXiv.
- cv-32: CV subject classifier.
- cyber-32: Cyber subject classifier.
- cyber-full-weak-negs: Cyber subject classifier trained on arXiv and OpenAlex weak-negative examples 10% the size of arXiv.
- nlp-32: NLP subject classifier.
- ro-32: Robotics subject classifier.
We deploy on Dataflow using custom containers that provide large model assets and pre-installed dependencies. The container runs on each worker during inference, so each worker has what it needs without long setup times or downloads. Docs on this are here.
The container is defined by our Dockerfile.
It sets up a Python environment correctly, then copies in our model assets from the ./models
directory.
Beam's RunInference API (see below) requires .pth
or state-dict-format PyTorch binaries, so we've run ./inference/to_state_dict.py
to produce these from selected model checkpoint files, writing .pth
binaries under the ./images/ directory, where they're copied into the container images on build.
export REPO=arxiv-classifier-v2
export IMAGE=ai
export IMAGE_URI=us-east1-docker.pkg.dev/gcp-cset-projects/$REPO/$IMAGE:latest
To update and test a container, you can build it locally:
docker build . --tag $IMAGE_URI
docker run --name ai-classifier-v2-local --entrypoint /bin/bash -it $IMAGE_URI
To make containers available for Dataflow jobs, we use Cloud Build, which builds
each image on GCP and pushes it to an Artifact Registry
repo
called arxiv-classifier-v2
.
gcloud builds submit . --tag $IMAGE_URI
At time of writing, building the container takes ~30 minutes after uploading the assets.
We minimize the container size by adding files that aren't needed in deployment to .gcloudignore
.
You can also deploy the custom container to a GCE instance for testing like so:
gcloud compute instances create-with-container arxiv-classifier-v2-deployment-testing \
--project=gcp-cset-projects \
--zone=us-east1-c \
--machine-type=n2-standard-2 \
--network-interface=network-tier=PREMIUM,subnet=default \
--maintenance-policy=MIGRATE \
--provisioning-model=STANDARD \
--service-account=855475113448-compute@developer.gserviceaccount.com \
--scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append \
--image=projects/cos-cloud/global/images/cos-stable-101-17162-127-42 \
--boot-disk-size=50GB \
--boot-disk-type=pd-balanced \
--boot-disk-device-name=arxiv-classifier-v2-deployment-testing \
--container-image=us-east1-docker.pkg.dev/gcp-cset-projects/arxiv-classifier-v2/ai:latest \
--container-restart-policy=always \
--container-privileged \
--container-stdin \
--container-tty \
--no-shielded-secure-boot \
--shielded-vtpm \
--shielded-integrity-monitoring \
--labels=ec-src=vm_add-gcloud,container-vm=cos-stable-101-17162-127-42
Scripts and resources for inference are in the ./inference
directory.
Deploying on Dataflow uses the ./inference/beam_runner.py
pipeline script and Beam's RunInference API.
We extract the merged corpus to GCS with extract-corpus.sh.
Via bq extract
this produces gs://arxiv-classifier-v2/dataflow/merged-corpus-input/merged-corpus-input-\*.jsonl
.
These files are given as the --input_path
on Dataflow.
There's a series of bash scripts for submitting jobs to Dataflow:
./inference/run.py
: Inference on the merged corpus using the AI model../inference/run.py
: Inference on the merged corpus using the cyber subject model../inference/run.py
: Inference on the merged corpus using the NLP subject model.
Outputs are written to gs://arxiv-classifier-v2/dataflow/merged-corpus-output/$MODEL/merged-corpus-predictions-
.
For small-scale testing, see ./inference/test-dataflow.sh
.
Cost estimation is available through ./inference/estimate_cost.py
.
There's also a predictions script for smallish local input without Beam, inference/run.py
.
And for testing on DirectRunner, ./inference/test-directrunner.sh
.