Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create aws.yml #91

Draft
wants to merge 44 commits into
base: Release/snowplow-unified/0.5.1
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
1587ff3
Create aws.yml
ilias1111 Nov 8, 2024
08010cc
Update aws.yml
ilias1111 Nov 8, 2024
8dfa9f0
Full refresh
ilias1111 Nov 8, 2024
f7d2a13
Update aws.yml
ilias1111 Nov 8, 2024
7642181
Update aws.yml
ilias1111 Nov 8, 2024
4f58b18
Lets Hope
ilias1111 Nov 8, 2024
e6935d3
Update aws.yml
ilias1111 Nov 8, 2024
0325b88
Update aws.yml
ilias1111 Nov 8, 2024
1c9f62c
Update aws.yml
ilias1111 Nov 8, 2024
31fcae7
Update aws.yml
ilias1111 Nov 8, 2024
c4b9e39
Total revamp and fingers crossed
ilias1111 Nov 8, 2024
bff89d4
Update aws.yml
ilias1111 Nov 8, 2024
045fe5e
Update aws.yml
ilias1111 Nov 8, 2024
1b1b889
Update aws.yml
ilias1111 Nov 8, 2024
6a2d2ba
Update aws.yml
ilias1111 Nov 8, 2024
2d13480
Update aws.yml
ilias1111 Nov 8, 2024
734d862
Update spark-defaults.conf
ilias1111 Nov 8, 2024
c59d5fb
Update aws.yml
ilias1111 Nov 8, 2024
8921f41
Update docker-compose.yml
ilias1111 Nov 8, 2024
d8b3814
Token addition
ilias1111 Nov 8, 2024
1e6b931
Update spark-defaults.conf
ilias1111 Nov 9, 2024
8692d86
Update spark-defaults.conf
ilias1111 Nov 9, 2024
e962fd5
Fixes
ilias1111 Nov 9, 2024
adcb933
Fixes
ilias1111 Nov 10, 2024
5f6d70c
Lets check what will happen
ilias1111 Nov 10, 2024
9bd35b7
Update aws.yml
ilias1111 Nov 10, 2024
9ad5a4c
Update aws.yml
ilias1111 Nov 10, 2024
46175f7
Update aws.yml
ilias1111 Nov 10, 2024
c12f7eb
Update aws.yml
ilias1111 Nov 10, 2024
48b4d85
Update aws.yml
ilias1111 Nov 10, 2024
cd075f6
Update spark-defaults.conf
ilias1111 Nov 11, 2024
846632f
Lets try that
ilias1111 Nov 11, 2024
fde5b1d
Add this settings
ilias1111 Nov 11, 2024
12a4475
Change base
ilias1111 Nov 11, 2024
dc3261d
Update docker-compose.yml
ilias1111 Nov 11, 2024
eb26b63
Update docker-compose.yml
ilias1111 Nov 11, 2024
3f85be5
Update docker-compose.yml
ilias1111 Nov 11, 2024
64424ee
Lets see
ilias1111 Nov 11, 2024
dc5c492
Update spark-defaults.conf
ilias1111 Nov 11, 2024
a45a483
Update spark-defaults.conf
ilias1111 Nov 11, 2024
b93d485
Working
ilias1111 Nov 11, 2024
8e4fc8a
Update spark-defaults.conf
ilias1111 Nov 11, 2024
d496e5b
Add in dbt project
ilias1111 Nov 11, 2024
f7268a2
Update dbt_project.yml
ilias1111 Nov 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 163 additions & 0 deletions .github/workflows/aws.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
name: List S3 Objects - AWS

on:
pull_request:

env:
AWS_REGION: eu-west-1
AWS_ROLE_ARN: "arn:aws:iam::719197435995:role/DbtSparkTestingActions"
S3_BUCKET: "dbt-spark-iceberg/github-integration-testing"
DBT_PROFILES_DIR: ./ci

permissions:
id-token: write
contents: read

jobs:
list_s3_objects:
name: list_s3_objects
runs-on: ubuntu-latest
defaults:
run:
working-directory: .github/workflows/spark_deployment
steps:
- name: Check out repository
uses: actions/checkout@v4

- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ env.AWS_ROLE_ARN }}
aws-region: ${{ env.AWS_REGION }}
mask-aws-account-id: true
mask-aws-role-arn: true
role-session-name: GithubActionsSession
role-duration-seconds: 3600
output-credentials: true

- name: Verify AWS credentials and S3 access
run: |
aws sts get-caller-identity
aws s3 ls s3://${{ env.S3_BUCKET }} --summarize
# Test S3 write access
echo "test" > test.txt
aws s3 cp test.txt s3://${{ env.S3_BUCKET }}/test.txt
aws s3 rm s3://${{ env.S3_BUCKET }}/test.txt

- name: Install Docker Compose
run: |
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
docker-compose --version

- name: Configure Docker environment
run: |
# Export AWS credentials from assumed role
export AWS_ACCESS_KEY_ID=$(aws configure get aws_access_key_id)
export AWS_SECRET_ACCESS_KEY=$(aws configure get aws_secret_access_key)
export AWS_SESSION_TOKEN=$(aws configure get aws_session_token)

# Create Docker environment file with proper escaping
echo "AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}" > .env
echo "AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}" > .env
echo "AWS_SESSION_TOKEN=${AWS_SESSION_TOKEN}" >> .env
echo "AWS_REGION=${AWS_REGION}" >> .env

- name: Configure Docker credentials
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_SNOWPLOWCI_READ_USERNAME }}
password: ${{ secrets.DOCKERHUB_SNOWPLOWCI_READ_PASSWORD }}

- name: Clean up Docker
run: |
docker system prune -af
docker volume prune -f

- name: Build and start Spark cluster
id: spark-startup
run: |
docker-compose up -d
echo "Waiting for Spark services to start..."
sleep 30 # Initial wait

# Get container ID and store it
CONTAINER_NAME=$(docker ps --format '{{.Names}}' | grep thrift-server)
echo "container_name=${CONTAINER_NAME}" >> $GITHUB_OUTPUT

# Wait for Spark to be fully initialized
for i in {1..30}; do
if docker logs ${CONTAINER_NAME} 2>&1 | grep -q "HiveThriftServer2 started"; then
echo "Spark initialized successfully"
break
fi
echo "Waiting for Spark initialization... attempt $i"
sleep 3
done

# Verify Spark is running
docker ps
docker logs ${CONTAINER_NAME}

- name: Python setup
uses: actions/setup-python@v4
with:
python-version: "3.8.x"

- name: Install spark dependencies
run: |
pip install --upgrade pip wheel setuptools
pip install -Iv "dbt-spark[PyHive]"==1.7.0 --upgrade

- name: Verify Spark cluster and connection
run: |
docker ps
docker logs ${{ steps.spark-startup.outputs.container_name }}
docker exec ${{ steps.spark-startup.outputs.container_name }} beeline -u "jdbc:hive2://localhost:10000" -e "show databases;"

- name: Run DBT Debug
working-directory: ./integration_tests
run: |
# Get service logs before attempting debug
docker logs ${{ steps.spark-startup.outputs.container_name }}
dbt deps
dbt debug --target spark_iceberg

- name: Clean up before tests
working-directory: ./integration_tests
run: dbt run-operation post_ci_cleanup --target spark_iceberg

- name: Run tests
working-directory: ./integration_tests
run: |
set -e
./.scripts/integration_test.sh -d spark_iceberg

- name: Capture Spark logs on failure
if: failure()
run: |
echo "Capturing Spark logs..."
docker logs ${{ steps.spark-startup.outputs.container_name }} > spark_logs.txt
cat spark_logs.txt

echo "Capturing Spark UI details..."
curl -v http://localhost:4040/api/v1/applications > spark_ui.txt || true
cat spark_ui.txt

- name: Upload logs as artifact
if: failure()
uses: actions/upload-artifact@v4
with:
name: spark-logs
path: |
spark_logs.txt
spark_ui.txt
compression-level: 6 # Moderate compression
retention-days: 5 # Keep logs for 5 days

- name: Cleanup
if: always()
run: |
docker-compose down
docker system prune -af
rm -f .env
63 changes: 18 additions & 45 deletions .github/workflows/spark_deployment/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,62 +5,35 @@ networks:
driver: bridge

services:
spark-master:
image: snowplow/spark-s3-iceberg:latest
command: ["/bin/bash", "-c", "/spark/sbin/start-master.sh -h spark-master --properties-file /spark/conf/spark-defaults.conf && tail -f /spark/logs/spark--org.apache.spark.deploy.master.Master-1-*.out"]
hostname: spark-master
ports:
- '8080:8080'
- '7077:7077'
environment:
- SPARK_LOCAL_IP=spark-master
- SPARK_MASTER_HOST=spark-master
- SPARK_MASTER_PORT=7077
- SPARK_MASTER_OPTS="-Dspark.driver.memory=2g"
- AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
- AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
- AWS_REGION=eu-west-1
- AWS_DEFAULT_REGION=eu-west-1
volumes:
- ./spark-defaults.conf:/spark/conf/spark-defaults.conf
networks:
- spark-network

spark-worker:
image: snowplow/spark-s3-iceberg:latest
command: ["/bin/bash", "-c", "sleep 10 && /spark/sbin/start-worker.sh spark://spark-master:7077 --properties-file /spark/conf/spark-defaults.conf && tail -f /spark/logs/spark--org.apache.spark.deploy.worker.Worker-*.out"]
depends_on:
- spark-master
environment:
- SPARK_WORKER_CORES=2
- SPARK_WORKER_MEMORY=4G
- SPARK_EXECUTOR_MEMORY=3G
- SPARK_LOCAL_IP=spark-worker
- SPARK_MASTER=spark://spark-master:7077
- AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
- AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
- AWS_REGION=eu-west-1
- AWS_DEFAULT_REGION=eu-west-1
volumes:
- ./spark-defaults.conf:/spark/conf/spark-defaults.conf
networks:
- spark-network

thrift-server:
image: snowplow/spark-s3-iceberg:latest
command: ["/bin/bash", "-c", "sleep 30 && /spark/sbin/start-thriftserver.sh --master spark://spark-master:7077 --driver-memory 2g --executor-memory 3g --hiveconf hive.server2.thrift.port=10000 --hiveconf hive.server2.thrift.bind.host=0.0.0.0 --conf spark.sql.hive.thriftServer.async=true --conf spark.sql.hive.thriftServer.workerQueue.size=2000 --conf spark.sql.hive.thriftServer.maxWorkerThreads=100 --conf spark.sql.hive.thriftServer.minWorkerThreads=50 && tail -f /spark/logs/spark--org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-*.out"]
ports:
- '10000:10000'
depends_on:
- spark-master
- spark-worker
- '4040:4040'
environment:
- SPARK_LOCAL_IP=thrift-server
# AWS credentials
- AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
- AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
- AWS_SESSION_TOKEN=${AWS_SESSION_TOKEN}
- AWS_REGION=eu-west-1
- AWS_DEFAULT_REGION=eu-west-1
deploy:
resources:
limits:
cpus: '3.5'
memory: 14GB
reservations:
cpus: '2'
memory: 10GB
volumes:
- ./spark-defaults.conf:/spark/conf/spark-defaults.conf
- ./setup.sh:/setup.sh
entrypoint: ["/bin/bash", "/setup.sh"]
command: ["/bin/bash", "-c", "/spark/sbin/start-thriftserver.sh \
--master local[3] \
--driver-memory 10g \
--executor-memory 3g \
&& tail -f /spark/logs/spark--org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-*.out"]
networks:
- spark-network
10 changes: 10 additions & 0 deletions .github/workflows/spark_deployment/setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@

#!/bin/bash

# Create a new spark-defaults.conf with substituted values
sed -e "s|\${AWS_ACCESS_KEY_ID}|$AWS_ACCESS_KEY_ID|g" \
-e "s|\${AWS_SECRET_ACCESS_KEY}|$AWS_SECRET_ACCESS_KEY|g" \
-e "s|\${AWS_SESSION_TOKEN}|$AWS_SESSION_TOKEN|g" \
/spark/conf/spark-defaults.conf.template > /spark/conf/spark-defaults.conf
# Execute the passed command
exec "$@"
66 changes: 32 additions & 34 deletions .github/workflows/spark_deployment/spark-defaults.conf
Original file line number Diff line number Diff line change
@@ -1,44 +1,42 @@
spark.master spark://spark-master:7077

spark.sql.warehouse.dir s3a://dbt-spark-iceberg/github-integration-testing
# Catalog and Schema Settings
spark.sql.defaultCatalog glue
spark.sql.catalog.glue org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.glue.catalog-impl org.apache.iceberg.aws.glue.GlueCatalog
spark.sql.catalog.glue.warehouse s3a://dbt-spark-iceberg/github-integration-testing
spark.sql.catalog.glue.io-impl org.apache.iceberg.aws.s3.S3FileIO
spark.sql.defaultCatalog glue
spark.sql.catalog.glue.database dbt-spark-iceberg

spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key <AWS_ACCESS_KEY_ID>
spark.hadoop.fs.s3a.secret.key <AWS_SECRET_ACCESS_KEY>
spark.hadoop.fs.s3a.endpoint s3.eu-west-1.amazonaws.com
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.region eu-west-1
spark.hadoop.fs.s3a.aws.region eu-west-1
# Default Schema Configuration
spark.sql.catalog.glue.default-namespace default_snowplow_manifest

# Critical Iceberg Settings
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.iceberg.handle-timestamp-without-timezone true
spark.wds.iceberg.format-version 2

# Enabling AWS SDK V4 signing (required for regions launched after January 2014)
spark.hadoop.com.amazonaws.services.s3.enableV4 true
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
# Enhanced Iceberg Write Settings
spark.sql.iceberg.write.distribution-mode range
spark.sql.iceberg.write.accept-any-schema true
spark.sql.iceberg.write.merge.mode copy-on-write
spark.sql.iceberg.write.format.default parquet
spark.sql.iceberg.write-partitioned-fanout.enabled true

# Hive Metastore Configuration (using AWS Glue)
spark.hadoop.hive.metastore.client.factory.class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
# Performance Settings
spark.sql.adaptive.enabled true
spark.sql.adaptive.coalescePartitions.enabled true
spark.sql.shuffle.partitions 10
spark.sql.parquet.compression.codec zstd

# Thrift Server Configuration for better performance in concurrent environments
spark.sql.hive.thriftServer.singleSession false
spark.sql.hive.thriftServer.async true
# spark.sql.hive.thriftServer.maxWorkerThreads 100
# spark.sql.hive.thriftServer.minWorkerThreads 50
# spark.sql.hive.thriftServer.workerQueue.size 2000
# AWS S3 Settings
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.connection.ssl.enabled true

# Memory and Performance Tuning
# spark.driver.memory 2g
# spark.executor.memory 3g
# spark.worker.memory 4g
spark.network.timeout 600s
spark.sql.broadcastTimeout 600s
spark.sql.adaptive.enabled true
spark.serializer org.apache.spark.serializer.KryoSerializer
# Memory Settings
spark.driver.memory 10g
spark.executor.memory 3g
spark.memory.fraction 0.85

# Logging and Debugging
spark.eventLog.enabled true
spark.eventLog.dir /tmp/spark-events
# Default Source Settings
spark.sql.sources.default iceberg
spark.sql.sources.partitionOverwriteMode dynamic
22 changes: 22 additions & 0 deletions integration_tests/dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,16 @@ quoting:
models:
snowplow_unified_integration_tests:
+materialized: table
+tblproperties:
write.format.default: parquet
write.metadata.delete-after-commit.enabled: true
write.distribution-mode: hash
write.merge.mode: merge-on-read
write.update.mode: merge-on-read
write.delete.mode: merge-on-read
format-version: '2'
write.target-file-size-bytes: '536870912'
write.metadata.previous-versions-max: '100'
bind: false
+schema: "snplw_unified_int_tests"
source:
Expand All @@ -40,6 +50,18 @@ models:
+enabled: "{{ target.type == 'snowflake' | as_bool() }}"
spark:
+enabled: "{{ target.type == 'spark' | as_bool() }}"
snowplow_unified:
+file_format: iceberg
+tblproperties:
write.format.default: parquet
write.metadata.delete-after-commit.enabled: true
write.distribution-mode: hash
write.merge.mode: merge-on-read
write.update.mode: merge-on-read
write.delete.mode: merge-on-read
format-version: '2'
write.target-file-size-bytes: '536870912'
write.metadata.previous-versions-max: '100'
vars:
snowplow_unified:
snowplow__allow_null_dvce_tstamps: true
Expand Down
Loading