Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DMP Match Workflow DMP API Integration [WIP] #30

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion queries/dmptool-workflows/.dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@ airflow_settings.yaml
logs/
.venv
airflow.db
airflow.cfg
airflow.cfg
3 changes: 2 additions & 1 deletion queries/dmptool-workflows/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,5 @@ airflow.cfg
airflow.db
.idea
workflows-config.yaml
docker-compose.override.yml
docker-compose.override.yml
venv
4 changes: 2 additions & 2 deletions queries/dmptool-workflows/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM quay.io/astronomer/astro-runtime:9.10.0
FROM quay.io/astronomer/astro-runtime:9.21.0

# Root user for installations
USER root
Expand All @@ -9,4 +9,4 @@ USER astro

# Install Observatory Platform
RUN git clone --branch feature/astro_kubernetes https://github.com/The-Academic-Observatory/observatory-platform.git
RUN pip install -e ./observatory-platform/ --constraint https://raw.githubusercontent.com/apache/airflow/constraints-2.7.3/constraints-no-providers-3.10.txt
RUN pip install -e ./observatory-platform/[tests] --constraint https://raw.githubusercontent.com/apache/airflow/constraints-2.7.3/constraints-no-providers-3.10.txt
135 changes: 108 additions & 27 deletions queries/dmptool-workflows/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,11 @@
Astronomer.io based Apache Airflow workflows for matching academic works to DMPTool DMPs using the Academic Observatory
BigQuery datasets.

## Dependencies
Install the Astro CLI: https://www.astronomer.io/docs/astro/cli/install-cli
## Requirements
* Astro CLI: https://www.astronomer.io/docs/astro/cli/install-cli
* gcloud CLI: https://cloud.google.com/sdk/docs/install

## Installation
Clone the project and enter the `dmptool-workflows` directory:
```bash
git clone --branch feature/dmp-works-matching [email protected]:CDLUC3/dmsp_api_prototype.git
Expand All @@ -22,28 +24,11 @@ Install Python dependencies:
pip install git+https://github.com/The-Academic-Observatory/observatory-platform.git@feature/astro_kubernetes --constraint https://raw.githubusercontent.com/apache/airflow/constraints-2.7.3/constraints-no-providers-3.10.txt
```

## Local Development Setup
Add the following to your `.env` file:
```bash
GOOGLE_APPLICATION_CREDENTIALS=/usr/local/airflow/gcloud/application_default_credentials.json
```

Add `docker-compose.override.yml` to the root of this project and customise the path to the Google Cloud credentials file:
```commandline
version: "3.1"
services:
scheduler:
volumes:
- /path/to/host/google-application-credentials.json:/usr/local/airflow/gcloud/application_default_credentials.json:ro
webserver:
volumes:
- /path/to/host/google-application-credentials.json:/usr/local/airflow/gcloud/application_default_credentials.json:ro
triggerer:
volumes:
- /path/to/host/google-application-credentials.json:/usr/local/airflow/gcloud/application_default_credentials.json:ro
```
## Config
The workflow is configured via a config file stored as an Apache Airflow variable. It is often easier to work in YAML
and then convert it to JSON.

Customise the `workflow-config.yaml` file:
`workflow-config.yaml` file:
```yaml
cloud_workspaces:
- workspace: &dmptool_dev
Expand All @@ -65,6 +50,31 @@ Convert `workflow-config.yaml` to JSON:
yq -o=json '.workflows' workflows-config.yaml | jq -c .
```

## Local Development
The following instructions show how to run the workflow locally.

### Setup
Add the following to your `.env` file:
```bash
GOOGLE_APPLICATION_CREDENTIALS=/usr/local/airflow/gcloud/application_default_credentials.json
```

Add `docker-compose.override.yml` to the root of this project and customise the path to the Google Cloud credentials file:
```commandline
version: "3.1"
services:
scheduler:
volumes:
- /path/to/host/google-application-credentials.json:/usr/local/airflow/gcloud/application_default_credentials.json:ro
webserver:
volumes:
- /path/to/host/google-application-credentials.json:/usr/local/airflow/gcloud/application_default_credentials.json:ro
triggerer:
volumes:
- /path/to/host/google-application-credentials.json:/usr/local/airflow/gcloud/application_default_credentials.json:ro
```


Create or add the following to `airflow_settings.yaml`, making sure to paste the JSON output from above into the
WORKFLOWS variable_value:
```yaml
Expand All @@ -77,15 +87,15 @@ airflow:
variable_value: REPLACE WITH WORKFLOWS JSON
```

## Running Airflow locally
### Running Airflow locally
Run the following command:
```bash
astro dev start
```

Then open the Airflow UI and run the workflow at: http://localhost:8080

## Running the Queries
### Running the Queries
You may also run or generate the queries. Customise the project IDs and the shard date (the shard date of the dmps_raw
table). Add `--dry-run` to just generate the SQL queries and not run them.
```bash
Expand All @@ -94,5 +104,76 @@ export PYTHONPATH=/path/to/dmptool-workflows/dags:$PYTHONPATH
python3 run_queries.py ao-project-id my-project-id YYYY-MM-DD
```

## Deploy
TODO
### Running tests
Make sure that the `dags` folder is on your Python path:
```bash
export PYTHONPATH=/path/to/dmptool-workflows/dags:$PYTHONPATH
```

Set the following environment variables:
* GOOGLE_APPLICATION_CREDENTIALS: as described above.
* TEST_GCP_PROJECT_ID: Google Cloud project ID for testing.
* TEST_GCP_DATA_LOCATION: the Google Cloud Storage and BigQuery data location.

Your service account needs the same permissions as granted in `./bin/setup-gcloud-project.sh`.

Run tests:
```bash
python -m unittest discover
```

## Deployment
Deploying the project consists of:
* Creating and configuring a Google Cloud project.
* Creating and configuring an Astronomer.io Apache Airflow deployment.
* Attaching a Customer Managed Service Account to your Astronomer.io deployment.
* Deploy Airflow workflows.

### Create Google Cloud Project
Create your Google Cloud project:
```bash
gcloud projects create my-project-id --name="My Project Name"
```

Configure your Google Cloud project with the following script:
```bash
(cd bin && ./setup-gcloud-project.sh my-project-id my-bucket-name)
```

Copy the "DMP Airflow Service Account" ID printed by this script.

### Create Astronomer.io Deployment
Switch to the Astronomer.io workspace that you want to work in:
```bash
astro workspace switch
```

Create your Astro deployment. Note that you may need to update or customise some of the variables, such as
runtime_version, workspace_name and alert_emails.
```bash
astro deployment create --deployment-file ./bin/deployment.yaml
```

Create Apache Airflow Variables, customising the value for the WORKFLOWS key. Your Airflow instance needs to be out
of hibernation to run the `airflow-variable create` and `connection create` commands.
```bash
astro deployment variable create GOOGLE_CLOUD_PROJECT=my-project-id
astro deployment airflow-variable create --key DATA_PATH --value /home/astro/data
astro deployment airflow-variable create --key WORKFLOWS --value '[{"dag_id":"dmp_match_workflow","name":"DMP Match Workflow","class_name":"dmptool_workflows.dmp_match_workflow.workflow","cloud_workspace":{...}}]'
astro deployment connection create --conn-id dmptool_api_credentials --conn-type http --login my-client-secret --password my-client-password
```

### Customer Managed Identity
Attach the "DMP Airflow Service Account" to your Astro deployment as a "Customer Managed Identity".

Follow the steps here: https://www.astronomer.io/docs/astro/authorize-deployments-to-your-cloud/#attach-a-service-account-to-your-deployment

Use the "DMP Airflow Service Account" email printed by the `setup-gcloud-project.sh` script.

Step 6 is not necessary.

### Deploy Airflow Workflows
Deploy workflows:
```bash
astro deploy
```
33 changes: 33 additions & 0 deletions queries/dmptool-workflows/bin/deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
deployment:
configuration:
name: "DMPTool Workflows"
description: "Apache Airflow workflows for ingesting bibliometric data into the DMPTool"
runtime_version: 9.21.0
dag_deploy_enabled: true
ci_cd_enforcement: false
scheduler_size: SMALL
is_high_availability: false
is_development_mode: true
executor: CELERY
scheduler_count: 1
workspace_name: "California Digital Library"
deployment_type: STANDARD
cloud_provider: GCP
region: us-central1
default_task_pod_cpu: "0.25"
default_task_pod_memory: 0.5Gi
resource_quota_cpu: "10"
resource_quota_memory: 20Gi
workload_identity: ""
worker_queues:
- name: default
max_worker_count: 10
min_worker_count: 0
worker_concurrency: 5
worker_type: A5
alert_emails: [ ]
hibernation_schedules:
- hibernate_at: 0 2 * * 0
wake_at: 0 0 * * 0
description: Wake at 12am and hibernate 2am every Sunday UTC
enabled: true
12 changes: 12 additions & 0 deletions queries/dmptool-workflows/bin/lifecycle.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"rule": [
{
"action": {
"type": "Delete"
},
"condition": {
"age": 31
}
}
]
}
135 changes: 135 additions & 0 deletions queries/dmptool-workflows/bin/setup-gcloud-project.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
#!/bin/bash

# Positional arguments
PROJECT_ID=$1
BUCKET_NAME=$2

# Optional arguments with default values
BQ_REGION="us"
GCS_REGION="us-central1"
CONNECTION_ID="vertex_ai"
PER_USER_PER_DAY=$((2 * 1024 * 1024)) # 2 TiB in MiB
PER_PROJECT_PER_DAY=$((2 * 1024 * 1024)) # 2 TiB in MiB
ACADEMIC_OBSERVATORY_PROJECT_ID="academic-observatory"

# Parse optional arguments
while [[ "$#" -gt 0 ]]; do
case $1 in
--bq-region) BQ_REGION="$2"; shift ;;
--gcs-region) GCS_REGION="$2"; shift ;;
--connection-id) CONNECTION_ID="$2"; shift ;;
--per-user-per-day) PER_USER_PER_DAY="$2"; shift ;;
--per-project-per-day) PER_PROJECT_PER_DAY="$2"; shift ;;
--academic-observatory-project-id) ACADEMIC_OBSERVATORY_PROJECT_ID="$2"; shift ;;
*) break ;;
esac
shift
done

# Check if required positional arguments are provided and if not print usage
if [[ -z "$PROJECT_ID" || -z "$BUCKET_NAME" ]]; then
echo "Usage: $0 <PROJECT_ID> <BUCKET_NAME> [optional arguments]"
echo "Optional arguments:"
echo " --bq-region <value> Default: us"
echo " --gcs-region <value> Default: us-central1"
echo " --connection-id <value> Default: vertex_ai"
echo " --per-user-per-day <value> Default: $((1 * 1024 * 1024)) (1 TiB in MiB)"
echo " --per-project-per-day <value> Default: $((1 * 1024 * 1024)) (1 TiB in MiB)"
echo " --academic-observatory-project-id <value> Default: academic-observatory"
exit 1
fi

echo "Configuration:"
echo " PROJECT_ID: $PROJECT_ID"
echo " BUCKET_NAME: $BUCKET_NAME"
echo " BQ_REGION: $BQ_REGION"
echo " GCS_REGION: $GCS_REGION"
echo " CONNECTION_ID: $CONNECTION_ID"
echo " PER_USER_PER_DAY: $PER_USER_PER_DAY"
echo " PER_PROJECT_PER_DAY: $PER_PROJECT_PER_DAY"
echo " ACADEMIC_OBSERVATORY_PROJECT_ID: $ACADEMIC_OBSERVATORY_PROJECT_ID"
echo ""

echo "Enable Google Cloud APIs"
gcloud services enable storage.googleapis.com \
bigquery.googleapis.com \
bigqueryconnection.googleapis.com \
aiplatform.googleapis.com --project=$PROJECT_ID

echo "Set BigQuery Quota"
gcloud alpha services quota update \
--consumer=projects/$PROJECT_ID \
--service bigquery.googleapis.com \
--metric bigquery.googleapis.com/quota/query/usage \
--value $PER_USER_PER_DAY --unit 1/d/{project}/{user} --force

gcloud alpha services quota update \
--consumer=projects/$PROJECT_ID \
--service bigquery.googleapis.com \
--metric bigquery.googleapis.com/quota/query/usage \
--value $PER_PROJECT_PER_DAY --unit 1/d/{project} --force

echo "Create DMP Airflow IAM role"
DMP_ROLE_NAME="DMPAirflowRole"
gcloud iam roles create $DMP_ROLE_NAME --project=$PROJECT_ID \
--title="DMP Airflow Role" \
--description="Gives Astro permissions to specific Google Cloud resources" \
--permissions=bigquery.datasets.create,\
bigquery.datasets.get,\
bigquery.datasets.update,\
bigquery.jobs.create,\
bigquery.tables.create,\
bigquery.tables.update,\
bigquery.tables.updateData,\
bigquery.tables.get,\
bigquery.tables.getData,\
bigquery.tables.list,\
bigquery.tables.export,\
bigquery.connections.use,\
bigquery.models.create,\
bigquery.models.updateData,\
bigquery.models.updateMetadata,\
bigquery.routines.create,\
bigquery.routines.update,\
bigquery.routines.get

echo "Create DMP Airflow Service Account"
DMP_SERVICE_ACCOUNT_NAME="dmp-airflow"
DMP_SERVICE_ACCOUNT="$DMP_SERVICE_ACCOUNT_NAME@$PROJECT_ID.iam.gserviceaccount.com"
gcloud iam service-accounts create $DMP_SERVICE_ACCOUNT_NAME \
--project=$PROJECT_ID \
--description="The DMP Tool Apache Airflow Service Account" \
--display-name="DMP Airflow Service Account"

echo "Add $DMP_ROLE_NAME Role to DMP Airflow Service Account"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:$DMP_SERVICE_ACCOUNT" \
--role=projects/$PROJECT_ID/roles/$DMP_ROLE_NAME

echo "Create GCS bucket and add lifecycle rules"
gcloud storage buckets create gs://$BUCKET_NAME --location=$GCS_REGION --project=$PROJECT_ID
gcloud storage buckets update gs://$BUCKET_NAME --lifecycle-file=lifecycle.json --project=$PROJECT_ID

echo "Give DMP Service Account permission to access bucket"
gsutil iam ch \
serviceAccount:$DMP_SERVICE_ACCOUNT:roles/storage.legacyBucketReader \
serviceAccount:$DMP_SERVICE_ACCOUNT:roles/storage.objectCreator \
serviceAccount:$DMP_SERVICE_ACCOUNT:roles/storage.objectViewer \
gs://$BUCKET_NAME

echo "Grant DMP Service Account access to Academic Observatory BigQuery"
gcloud projects add-iam-policy-binding $ACADEMIC_OBSERVATORY_PROJECT_ID \
--member="serviceAccount:$DMP_SERVICE_ACCOUNT" \
--role="roles/bigquery.dataViewer"

echo "Create a Cloud Resource Connection"
bq mk --connection --location=$BQ_REGION --project_id=$PROJECT_ID --connection_type=CLOUD_RESOURCE $CONNECTION_ID
CRC_SERVICE_ACCOUNT=$(bq show --connection $PROJECT_ID.$BQ_REGION.$CONNECTION_ID | grep -oP '(?<="serviceAccountId": ")[^"]+')

echo "Grant Cloud Resource Connection Service Account access to Vertex AI"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:$CRC_SERVICE_ACCOUNT" \
--role="roles/aiplatform.user"

echo ""
echo "DMP Airflow Service Account: $DMP_SERVICE_ACCOUNT"
Loading