Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): Refactor API server to introduce support for Docker image registries and s3 bucket support #605

Merged
merged 32 commits into from
Nov 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
d438422
Add multi keychain when checking if images exist
deadlycoconuts Sep 6, 2024
ec94d9f
Refactor API methods to make GCP dependencies optional
deadlycoconuts Sep 12, 2024
ea8c5b3
Update go mod file with temp dependencies
deadlycoconuts Sep 12, 2024
110adf7
Update pyfunc server dockerfiles
deadlycoconuts Sep 12, 2024
c217220
Replace curl with wget
deadlycoconuts Sep 12, 2024
510f03e
Rearrange dependencies
deadlycoconuts Sep 12, 2024
e341e5d
Add step to install unzip
deadlycoconuts Sep 12, 2024
9502cbe
Fix apt-get installation
deadlycoconuts Sep 12, 2024
0000fcb
Update pyfunc server dockerfile
deadlycoconuts Sep 12, 2024
980dc74
Fix image builder error parsing
deadlycoconuts Sep 13, 2024
66bd797
Fix pyfunc-server dockerfile
deadlycoconuts Sep 13, 2024
36c5d54
Fix s3 cp destination in dockerfile
deadlycoconuts Sep 13, 2024
d3b9979
Update batch predictor dockerfiles
deadlycoconuts Sep 16, 2024
cafc554
Fix broken model dependencies hash check
deadlycoconuts Sep 16, 2024
8a44fa7
Add step to install unzip
deadlycoconuts Sep 16, 2024
9f84f6d
Add kaniko push registry type to e2e test configs
deadlycoconuts Sep 17, 2024
5bbdcac
Fix unit test
deadlycoconuts Sep 17, 2024
07bbfdf
Update go mod files
deadlycoconuts Sep 17, 2024
57aa172
Clean up image builder
deadlycoconuts Sep 18, 2024
f6b3ff0
Remove the redundant usage of multikeychains
deadlycoconuts Sep 20, 2024
c0ec620
Remove redundant config value
deadlycoconuts Sep 20, 2024
f0d8b6d
Fix incorrect kaniko push registry type specified in app startup
deadlycoconuts Sep 26, 2024
659940d
Refactor code to improve code readability
deadlycoconuts Sep 30, 2024
6bd7a4d
Update go mod files
deadlycoconuts Oct 10, 2024
acccc98
Remove code to parse env vars into kaniko arguments
deadlycoconuts Oct 20, 2024
8115572
Remove codecov github action and configs temporarily
deadlycoconuts Oct 20, 2024
e2dd955
Fix bug whereby gcs credentials are not passed to kaniko when gcs is …
deadlycoconuts Oct 20, 2024
ea7901e
Undo unnecessary default db port change
deadlycoconuts Oct 20, 2024
35ad11e
Add additional steps to download merlin artifacts from s3 storage loc…
deadlycoconuts Nov 7, 2024
fe5de0d
Introduce new constants for commonly used artifact and registry types…
deadlycoconuts Nov 7, 2024
163f60b
Fix leftover gcs naming
deadlycoconuts Nov 7, 2024
76ecdc5
Fix incorrect function name called in unit test
deadlycoconuts Nov 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 0 additions & 5 deletions .github/workflows/codecov-config/codecov.yml

This file was deleted.

14 changes: 0 additions & 14 deletions .github/workflows/merlin.yml
Original file line number Diff line number Diff line change
Expand Up @@ -137,13 +137,6 @@ jobs:
E2E_USE_GOOGLE_OAUTH: false
working-directory: ./python/sdk
run: make unit-test
- uses: codecov/codecov-action@v4
deadlycoconuts marked this conversation as resolved.
Show resolved Hide resolved
with:
flags: sdk-test-${{ matrix.python-version }}
name: sdk-test-${{ matrix.python-version }}
token: ${{ secrets.CODECOV_TOKEN }}
working-directory: ./python/sdk
codecov_yml_path: ../../.github/workflows/codecov-config/codecov.yml

lint-api:
runs-on: ubuntu-latest
Expand Down Expand Up @@ -188,13 +181,6 @@ jobs:
POSTGRES_USER: ${{ secrets.DB_USERNAME }}
POSTGRES_PASSWORD: ${{ secrets.DB_PASSWORD }}
run: make it-test-api-ci
- uses: codecov/codecov-action@v4
with:
flags: api-test
name: api-test
token: ${{ secrets.CODECOV_TOKEN }}
working-directory: ./api
codecov_yml_path: ../.github/workflows/codecov-config/codecov.yml

test-observation-publisher:
runs-on: ubuntu-latest
Expand Down
91 changes: 50 additions & 41 deletions api/cmd/api/setup.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ import (
"net/http"
"time"

gcs "cloud.google.com/go/storage"
"github.com/GoogleCloudPlatform/spark-on-k8s-operator/pkg/client/clientset/versioned"
"github.com/caraml-dev/merlin/webhook"
"github.com/caraml-dev/mlp/api/pkg/artifact"
Expand Down Expand Up @@ -105,55 +104,65 @@ func initImageBuilder(cfg *config.Config) (webserviceBuilder imagebuilder.ImageB
}

var artifactService artifact.Service
if cfg.ImageBuilderConfig.ArtifactServiceType == "gcs" {
gcsClient, err := gcs.NewClient(context.Background())
if err != nil {
log.Panicf("%s,failed initializing gcs for mlflow delete package", err.Error())
}
artifactService = artifact.NewGcsArtifactClient(gcsClient)
} else if cfg.ImageBuilderConfig.ArtifactServiceType == "nop" {
if cfg.MlflowConfig.ArtifactServiceType == "gcs" {
artifactService, err = artifact.NewGcsArtifactClient()
} else if cfg.MlflowConfig.ArtifactServiceType == "s3" {
artifactService, err = artifact.NewS3ArtifactClient()
} else if cfg.MlflowConfig.ArtifactServiceType == "nop" {
artifactService = artifact.NewNopArtifactClient()
} else {
log.Panicf("invalid artifact service type %s", cfg.ImageBuilderConfig.ArtifactServiceType)
log.Panicf("invalid artifact service type %s", cfg.MlflowConfig.ArtifactServiceType)
}
if err != nil {
log.Panicf("%s,failed initializing mlflow artifact service", err.Error())
}

if cfg.ImageBuilderConfig.KanikoPushRegistryType != "gcr" &&
cfg.ImageBuilderConfig.KanikoPushRegistryType != "docker" {
log.Panicf("invalid kaniko push registry type %s", cfg.ImageBuilderConfig.KanikoPushRegistryType)
}

webServiceConfig := imagebuilder.Config{
BaseImage: cfg.ImageBuilderConfig.BaseImage,
BuildNamespace: cfg.ImageBuilderConfig.BuildNamespace,
DockerRegistry: cfg.ImageBuilderConfig.DockerRegistry,
BuildTimeoutDuration: timeout,
KanikoImage: cfg.ImageBuilderConfig.KanikoImage,
KanikoServiceAccount: cfg.ImageBuilderConfig.KanikoServiceAccount,
KanikoAdditionalArgs: cfg.ImageBuilderConfig.KanikoAdditionalArgs,
DefaultResources: cfg.ImageBuilderConfig.DefaultResources,
Tolerations: cfg.ImageBuilderConfig.Tolerations,
NodeSelectors: cfg.ImageBuilderConfig.NodeSelectors,
MaximumRetry: cfg.ImageBuilderConfig.MaximumRetry,
SafeToEvict: cfg.ImageBuilderConfig.SafeToEvict,
ClusterName: cfg.ImageBuilderConfig.ClusterName,
GcpProject: cfg.ImageBuilderConfig.GcpProject,
Environment: cfg.Environment,
SupportedPythonVersions: cfg.ImageBuilderConfig.SupportedPythonVersions,
BaseImage: cfg.ImageBuilderConfig.BaseImage,
BuildNamespace: cfg.ImageBuilderConfig.BuildNamespace,
DockerRegistry: cfg.ImageBuilderConfig.DockerRegistry,
BuildTimeoutDuration: timeout,
KanikoImage: cfg.ImageBuilderConfig.KanikoImage,
KanikoPushRegistryType: cfg.ImageBuilderConfig.KanikoPushRegistryType,
KanikoDockerCredentialSecretName: cfg.ImageBuilderConfig.KanikoDockerCredentialSecretName,
KanikoServiceAccount: cfg.ImageBuilderConfig.KanikoServiceAccount,
KanikoAdditionalArgs: cfg.ImageBuilderConfig.KanikoAdditionalArgs,
DefaultResources: cfg.ImageBuilderConfig.DefaultResources,
Tolerations: cfg.ImageBuilderConfig.Tolerations,
NodeSelectors: cfg.ImageBuilderConfig.NodeSelectors,
MaximumRetry: cfg.ImageBuilderConfig.MaximumRetry,
SafeToEvict: cfg.ImageBuilderConfig.SafeToEvict,
ClusterName: cfg.ImageBuilderConfig.ClusterName,
GcpProject: cfg.ImageBuilderConfig.GcpProject,
Environment: cfg.Environment,
SupportedPythonVersions: cfg.ImageBuilderConfig.SupportedPythonVersions,
}
webserviceBuilder = imagebuilder.NewModelServiceImageBuilder(kubeClient, webServiceConfig, artifactService)

predJobConfig := imagebuilder.Config{
BaseImage: cfg.ImageBuilderConfig.PredictionJobBaseImage,
BuildNamespace: cfg.ImageBuilderConfig.BuildNamespace,
DockerRegistry: cfg.ImageBuilderConfig.DockerRegistry,
BuildTimeoutDuration: timeout,
KanikoImage: cfg.ImageBuilderConfig.KanikoImage,
KanikoServiceAccount: cfg.ImageBuilderConfig.KanikoServiceAccount,
KanikoAdditionalArgs: cfg.ImageBuilderConfig.KanikoAdditionalArgs,
DefaultResources: cfg.ImageBuilderConfig.DefaultResources,
Tolerations: cfg.ImageBuilderConfig.Tolerations,
NodeSelectors: cfg.ImageBuilderConfig.NodeSelectors,
MaximumRetry: cfg.ImageBuilderConfig.MaximumRetry,
SafeToEvict: cfg.ImageBuilderConfig.SafeToEvict,
ClusterName: cfg.ImageBuilderConfig.ClusterName,
GcpProject: cfg.ImageBuilderConfig.GcpProject,
Environment: cfg.Environment,
SupportedPythonVersions: cfg.ImageBuilderConfig.SupportedPythonVersions,
BaseImage: cfg.ImageBuilderConfig.PredictionJobBaseImage,
BuildNamespace: cfg.ImageBuilderConfig.BuildNamespace,
DockerRegistry: cfg.ImageBuilderConfig.DockerRegistry,
BuildTimeoutDuration: timeout,
KanikoImage: cfg.ImageBuilderConfig.KanikoImage,
KanikoPushRegistryType: cfg.ImageBuilderConfig.KanikoPushRegistryType,
KanikoDockerCredentialSecretName: cfg.ImageBuilderConfig.KanikoDockerCredentialSecretName,
KanikoServiceAccount: cfg.ImageBuilderConfig.KanikoServiceAccount,
KanikoAdditionalArgs: cfg.ImageBuilderConfig.KanikoAdditionalArgs,
DefaultResources: cfg.ImageBuilderConfig.DefaultResources,
Tolerations: cfg.ImageBuilderConfig.Tolerations,
NodeSelectors: cfg.ImageBuilderConfig.NodeSelectors,
MaximumRetry: cfg.ImageBuilderConfig.MaximumRetry,
SafeToEvict: cfg.ImageBuilderConfig.SafeToEvict,
ClusterName: cfg.ImageBuilderConfig.ClusterName,
GcpProject: cfg.ImageBuilderConfig.GcpProject,
Environment: cfg.Environment,
SupportedPythonVersions: cfg.ImageBuilderConfig.SupportedPythonVersions,
}
predJobBuilder = imagebuilder.NewPredictionJobImageBuilder(kubeClient, predJobConfig, artifactService)

Expand Down
36 changes: 21 additions & 15 deletions api/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ type BaseImageConfig struct {
ImageName string `validate:"required" json:"imageName"`
// Dockerfile Path within the build context
DockerfilePath string `validate:"required" json:"dockerfilePath"`
// GCS URL Containing build context
// GCS/S3 URL Containing build context
BuildContextURI string `validate:"required" json:"buildContextURI"`
// Path to sub folder which is intended to build instead of using root folder
BuildContextSubPath string `json:"buildContextSubPath"`
Expand Down Expand Up @@ -209,18 +209,19 @@ type ClusterConfig struct {
}

type ImageBuilderConfig struct {
ClusterName string `validate:"required"`
GcpProject string
ArtifactServiceType string
deadlycoconuts marked this conversation as resolved.
Show resolved Hide resolved
BaseImage BaseImageConfig `validate:"required"`
PredictionJobBaseImage BaseImageConfig `validate:"required"`
BuildNamespace string `validate:"required" default:"mlp"`
DockerRegistry string `validate:"required"`
BuildTimeout string `validate:"required" default:"10m"`
KanikoImage string `validate:"required" default:"gcr.io/kaniko-project/executor:v1.6.0"`
KanikoServiceAccount string
KanikoAdditionalArgs []string
DefaultResources ResourceRequestsLimits `validate:"required"`
ClusterName string `validate:"required"`
GcpProject string
BaseImage BaseImageConfig `validate:"required"`
PredictionJobBaseImage BaseImageConfig `validate:"required"`
BuildNamespace string `validate:"required" default:"mlp"`
DockerRegistry string `validate:"required"`
BuildTimeout string `validate:"required" default:"10m"`
KanikoImage string `validate:"required" default:"gcr.io/kaniko-project/executor:v1.6.0"`
KanikoServiceAccount string
KanikoPushRegistryType string `validate:"required,oneof=docker gcr" default:"docker"`
KanikoDockerCredentialSecretName string
KanikoAdditionalArgs []string
DefaultResources ResourceRequestsLimits `validate:"required"`
// How long to keep the image building job resource in the Kubernetes cluster. Default: 2 days (48 hours).
Retention time.Duration `validate:"required" default:"48h"`
Tolerations Tolerations
Expand Down Expand Up @@ -453,8 +454,13 @@ type JaegerConfig struct {
}

type MlflowConfig struct {
TrackingURL string `validate:"required"`
ArtifactServiceType string `validate:"required"`
TrackingURL string `validate:"required_if=ArtifactServiceType gcs ArtifactServiceType s3"`
// Note that the Kaniko image builder needs to be configured correctly to have the necessary credentials to download
// the artifacts from the blob storage tool depending on the artifact service type selected (gcs/s3). For gcs, the
// credentials can be provided via a k8s service account or a secret but for s3, the credentials can be provided via
// additional arguments in the config KanikoAdditionalArgs e.g.
// --build-arg=[AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY/AWS_DEFAULT_REGION/AWS_ENDPOINT_URL]=xxx
ArtifactServiceType string `validate:"required,oneof=nop gcs s3"`
}

func (cfg *Config) Validate() error {
Expand Down
18 changes: 9 additions & 9 deletions api/config/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -387,9 +387,8 @@ func TestLoad(t *testing.T) {
EnvironmentConfigs: []*EnvironmentConfig{},
},
ImageBuilderConfig: ImageBuilderConfig{
ClusterName: "test-cluster",
GcpProject: "test-project",
ArtifactServiceType: "gcs",
ClusterName: "test-cluster",
GcpProject: "test-project",
BaseImage: BaseImageConfig{
ImageName: "ghcr.io/caraml-dev/merlin/merlin-pyfunc-base:0.0.0",
DockerfilePath: "pyfunc-server/docker/Dockerfile",
Expand All @@ -403,12 +402,13 @@ func TestLoad(t *testing.T) {
BuildContextSubPath: "python",
MainAppPath: "/home/spark/merlin-spark-app/main.py",
},
BuildNamespace: "caraml",
DockerRegistry: "test-docker.pkg.dev/test/caraml-registry",
BuildTimeout: "30m",
KanikoImage: "gcr.io/kaniko-project/executor:v1.21.0",
KanikoServiceAccount: "kaniko-merlin",
KanikoAdditionalArgs: []string{"--test=true", "--no-logs=false"},
BuildNamespace: "caraml",
DockerRegistry: "test-docker.pkg.dev/test/caraml-registry",
BuildTimeout: "30m",
KanikoImage: "gcr.io/kaniko-project/executor:v1.21.0",
KanikoServiceAccount: "kaniko-merlin",
KanikoPushRegistryType: "docker",
KanikoAdditionalArgs: []string{"--test=true", "--no-logs=false"},
DefaultResources: ResourceRequestsLimits{
Requests: Resource{
CPU: "1",
Expand Down
23 changes: 21 additions & 2 deletions api/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,14 @@ go 1.22

require (
cloud.google.com/go/bigtable v1.11.0
cloud.google.com/go/storage v1.39.0
github.com/GoogleCloudPlatform/spark-on-k8s-operator v0.0.0-20221025152940-c261df66a006
github.com/afex/hystrix-go v0.0.0-20180502004556-fa1af6a1f4f5
github.com/antihax/optional v1.0.0
github.com/antonmedv/expr v1.12.5
github.com/bboughton/gcp-helpers v0.1.0
github.com/buger/jsonparser v1.1.1
github.com/caraml-dev/merlin-pyspark-app v0.0.3
github.com/caraml-dev/mlp v1.13.2-rc2
github.com/caraml-dev/mlp v1.13.2
github.com/caraml-dev/protopath v0.1.0
github.com/caraml-dev/universal-prediction-interface v1.0.0
github.com/cenkalti/backoff/v4 v4.2.1
Expand Down Expand Up @@ -246,11 +245,31 @@ require (
)

require (
cloud.google.com/go/storage v1.39.0 // indirect
github.com/avast/retry-go/v4 v4.6.0 // indirect
github.com/aws/aws-sdk-go-v2 v1.30.6-0.20240906182417-827d25db0048 // indirect
github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.6.4 // indirect
github.com/aws/aws-sdk-go-v2/config v1.17.8 // indirect
github.com/aws/aws-sdk-go-v2/credentials v1.12.21 // indirect
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.12.17 // indirect
github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.17 // indirect
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.17 // indirect
github.com/aws/aws-sdk-go-v2/internal/ini v1.3.24 // indirect
github.com/aws/aws-sdk-go-v2/internal/v4a v1.3.17 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.11.4 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.3.19 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.11.19 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.17.17 // indirect
github.com/aws/aws-sdk-go-v2/service/s3 v1.61.2 // indirect
github.com/aws/aws-sdk-go-v2/service/sso v1.11.23 // indirect
github.com/aws/aws-sdk-go-v2/service/ssooidc v1.13.6 // indirect
github.com/aws/aws-sdk-go-v2/service/sts v1.16.19 // indirect
github.com/aws/smithy-go v1.20.4 // indirect
golang.org/x/time v0.5.0 // indirect
k8s.io/klog/v2 v2.120.1 // indirect
k8s.io/kube-openapi v0.0.0-20231113174909-778a5567bc1e // indirect
sigs.k8s.io/structured-merge-diff/v4 v4.4.1 // indirect

)

replace (
Expand Down
Loading
Loading