-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the Flyte agent to provision and manage K8s (data) service for deep learning (GNN) use cases #3004
base: master
Are you sure you want to change the base?
Add the Flyte agent to provision and manage K8s (data) service for deep learning (GNN) use cases #3004
Conversation
983dc2a
to
944a500
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3004 +/- ##
===========================================
+ Coverage 51.08% 90.46% +39.38%
===========================================
Files 201 100 -101
Lines 21231 4920 -16311
Branches 2731 0 -2731
===========================================
- Hits 10846 4451 -6395
+ Misses 9787 469 -9318
+ Partials 598 0 -598 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing!!! leave some minor comments
plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/agent.py
Outdated
Show resolved
Hide resolved
plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/agent.py
Outdated
Show resolved
Hide resolved
plugins/flytekit-k8sdataservice/k8s_ops/k8s-service-agent-rolebinding.yaml
Outdated
Show resolved
Hide resolved
plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/k8s/manager.py
Outdated
Show resolved
Hide resolved
plugins/flytekit-k8sdataservice/tests/k8sdataservice/test_agent.py
Outdated
Show resolved
Hide resolved
plugins/flytekit-k8sdataservice/tests/k8sdataservice/test_agent.py
Outdated
Show resolved
Hide resolved
a0c5d8e
to
ec6d4c1
Compare
@@ -0,0 +1,86 @@ | |||
# Example of the role/binding set up for the data service to create/update/delete resources in the sandbox flyte namespace | |||
apiVersion: rbac.authorization.k8s.io/v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding this file to flytekit, could we add this to the agent setup guide? https://docs.flyte.org/en/latest/deployment/agents/databricks.html#deployment-agent-setup-databricks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the documentation in in flyte repo, https://github.com/flyteorg/flyte/tree/master/docs/deployment/agents
I will remove the set up guide here and create new branch to flyte repo.
plugins/flytekit-k8sdataservice/k8s_ops/k8s-service-agent-rolebinding.yaml
Outdated
Show resolved
Hide resolved
plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/__init__.py
Outdated
Show resolved
Hide resolved
a2e628f
to
43e2733
Compare
Code Review Agent Run #a24be6Actionable Suggestions - 16
Additional Suggestions - 10
Review Details
|
Changelist by BitoThis pull request implements the following key changes.
|
plugins/flytekit-k8sdataservice/tests/k8sdataservice/test_agent.py
Outdated
Show resolved
Hide resolved
def test_show_environment(): | ||
|
||
env = Environment(retries=2) | ||
|
||
env.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test_show_environment()
test case appears to be incomplete as it only calls show()
without any assertions to verify the expected behavior. Consider adding assertions to validate the output format and content.
Code suggestion
Check the AI-generated fix before applying
@@ -74,5 +74,10 @@
def test_show_environment():
env = Environment(retries=2)
+ from io import StringIO
+ import sys
+ captured_output = StringIO()
+ sys.stdout = captured_output
env.show()
+ sys.stdout = sys.__stdout__
+ assert "retries" in captured_output.getvalue()
+ assert "2" in captured_output.getvalue()
Code Review Run #a24be6
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
tests/flytekit/integration/remote/workflows/basic/attr_access_sd.py
Outdated
Show resolved
Hide resolved
def create( | ||
self, task_template: TaskTemplate, output_prefix: str, inputs: Optional[LiteralMap] = None, **kwargs | ||
) -> DataServiceMetadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'create' method has unused parameters and is missing type annotations for 'kwargs'.
Code suggestion
Check the AI-generated fix before applying
def create( | |
self, task_template: TaskTemplate, output_prefix: str, inputs: Optional[LiteralMap] = None, **kwargs | |
) -> DataServiceMetadata: | |
def create( | |
self, task_template: TaskTemplate, **kwargs: dict[str, Any] | |
) -> DataServiceMetadata: |
Code Review Run #a24be6
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
|
||
def gen_infra_name() -> str: | ||
random_uuid = uuid.uuid4().hex | ||
hash_object = hashlib.sha1(random_uuid.encode()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of 'sha1' hash function is considered insecure. Consider using a more secure alternative like 'sha256' or 'sha512'.
Code suggestion
Check the AI-generated fix before applying
hash_object = hashlib.sha1(random_uuid.encode()) | |
hash_object = hashlib.sha256(random_uuid.encode()) |
Code Review Run #a24be6
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
self.k8s_config = KubeConfig() | ||
self.k8s_config.load_kube_config() | ||
self.apps_v1_api = client.AppsV1Api() | ||
self.core_v1_api = client.CoreV1Api() | ||
self.custom_api = client.CustomObjectsApi() | ||
self.release_name = release_name | ||
self.cleanup_data_service = cleanup_data_service | ||
self.namespace = "flyte" | ||
self.cluster = cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider moving the Kubernetes client initialization to __init__
method since these configurations are used across multiple methods and don't need to be recreated on each poke
call.
Code suggestion
Check the AI-generated fix before applying
super().__init__(name=name, task_type="sensor", **kwargs)
+ self.k8s_config = KubeConfig()
+ self.k8s_config.load_kube_config()
+ self.apps_v1_api = client.AppsV1Api()
+ self.core_v1_api = client.CoreV1Api()
+ self.custom_api = client.CustomObjectsApi()
+ self.namespace = "flyte"
+
@@ -27,12 +34,6 @@
it for simplicity. This is also why we use the sensor API to keep forward compatibility
"""
- self.k8s_config = KubeConfig()
- self.k8s_config.load_kube_config()
- self.apps_v1_api = client.AppsV1Api()
- self.core_v1_api = client.CoreV1Api()
- self.custom_api = client.CustomObjectsApi()
- self.namespace = "flyte"
self.release_name = release_name
self.cleanup_data_service = cleanup_data_service
self.cluster = cluster
Code Review Run #a24be6
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
|
||
microlib_name = f"flytekitplugins-{PLUGIN_NAME}" | ||
|
||
plugin_requires = ["flytekit>=1.11.0", "kubernetes>=23.6.0", "flyteidl>=1.11.0"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider pinning exact versions of dependencies instead of using >=
to ensure reproducible builds. The current setup allows any version above the minimum which could lead to compatibility issues.
Code suggestion
Check the AI-generated fix before applying
plugin_requires = ["flytekit>=1.11.0", "kubernetes>=23.6.0", "flyteidl>=1.11.0"] | |
plugin_requires = ["flytekit>=1.11.0,<2.0.0", "kubernetes>=23.6.0,<24.0.0", "flyteidl>=1.11.0,<2.0.0"] |
Code Review Run #a24be6
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
flytekit/core/local_cache.py
Outdated
input_literal_map, | ||
cache_ignore_input_vars, | ||
), | ||
value.to_flyte_idl().SerializeToString(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding error handling around to_flyte_idl().SerializeToString()
call as serialization operations can potentially fail.
Code suggestion
Check the AI-generated fix before applying
value.to_flyte_idl().SerializeToString(), | |
try: | |
value.to_flyte_idl().SerializeToString() | |
except Exception as e: | |
logger.error(f"Failed to serialize literal map: {e}") | |
raise |
Code Review Run #a24be6
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
|
||
|
||
def convert_flyte_to_k8s_fields(resources_dict): | ||
return {("memory" if "mem" in k else k): v for k, v in resources_dict.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using a more explicit dictionary comprehension for convert_flyte_to_k8s_fields
. The current implementation using 'mem' in k
could match unintended keys containing 'mem'. Consider using exact key matching.
Code suggestion
Check the AI-generated fix before applying
return {("memory" if "mem" in k else k): v for k, v in resources_dict.items()} | |
return {("memory" if k == "mem" else k): v for k, v in resources_dict.items()} |
Code Review Run #a24be6
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
@patch("flytekitplugins.k8sdataservice.k8s.manager.client.AppsV1Api.create_namespaced_stateful_set") | ||
def test_create_stateful_set_failure(self, mock_create_namespaced_stateful_set): | ||
mock_create_namespaced_stateful_set.side_effect = ApiException("Create failed") | ||
stateful_set_object = self.k8s_manager.create_stateful_set_object() | ||
response = self.k8s_manager.create_stateful_set(stateful_set_object) | ||
self.assertEqual(response, "failed_stateful_set_name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test case test_create_stateful_set_failure
could be improved by verifying the error message from ApiException
is properly logged. Consider asserting the logger call with the expected error message.
Code suggestion
Check the AI-generated fix before applying
@patch("flytekitplugins.k8sdataservice.k8s.manager.client.AppsV1Api.create_namespaced_stateful_set") | |
def test_create_stateful_set_failure(self, mock_create_namespaced_stateful_set): | |
mock_create_namespaced_stateful_set.side_effect = ApiException("Create failed") | |
stateful_set_object = self.k8s_manager.create_stateful_set_object() | |
response = self.k8s_manager.create_stateful_set(stateful_set_object) | |
self.assertEqual(response, "failed_stateful_set_name") | |
@patch("flytekitplugins.k8sdataservice.k8s.manager.logger") | |
def test_create_stateful_set_failure(self, mock_create_namespaced_stateful_set): | |
mock_create_namespaced_stateful_set.side_effect = ApiException("Create failed") | |
stateful_set_object = self.k8s_manager.create_stateful_set_object() | |
response = self.k8s_manager.create_stateful_set(stateful_set_object) | |
self.assertEqual(response, "failed_stateful_set_name") | |
mock_logger.error.assert_called_once() | |
logged_message = mock_logger.error.call_args[0][0] | |
self.assertIn("Exception when calling AppsV1Api->create_namespaced_stateful_set: Create failed", logged_message) |
Code Review Run #a24be6
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
…In internal things removed Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
* Fix pydantic default input Signed-off-by: Future-Outlier <[email protected]> * add pydantic integration test Signed-off-by: Future-Outlier <[email protected]> * Use duck typing by Thomas's advice Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]> * lint Signed-off-by: Future-Outlier <[email protected]> --------- Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]> Signed-off-by: Shuying Liang <[email protected]>
* fix: Open FlyteFile from remote path Signed-off-by: JiaWei Jiang <[email protected]> * Add integration test Signed-off-by: JiaWei Jiang <[email protected]> * refactor: Use ctx as param instead of recreation Signed-off-by: JiaWei Jiang <[email protected]> * refactor: Clean test logic 1. Remove redundant prints 2. Use `mock.patch.dict` to setup `os.environ` for the current test fn * Avoid contaminating other tests running in the same process Signed-off-by: JiaWei Jiang <[email protected]> * refactor: Setup local path and downloader in constructor Signed-off-by: JiaWei Jiang <[email protected]> * refactor: Move SimpleFileTransfer to an utility file Signed-off-by: JiaWei Jiang <[email protected]> * Remove redundant env var setup Please refer to flyteorg#3001 Signed-off-by: JiaWei Jiang <[email protected]> * test: Add another ff use case Create ff in one task pod and read it in another task pod. Signed-off-by: JiaWei Jiang <[email protected]> --------- Signed-off-by: JiaWei Jiang <[email protected]> Signed-off-by: Shuying Liang <[email protected]>
* test: Add integration test for attr access of sd Signed-off-by: JiaWei Jiang <[email protected]> * Correct file path Signed-off-by: JiaWei Jiang <[email protected]> * test: Support interaction with minio s3 bucket 1. Upload a local parquet file to minio s3 bucket 2. Access StructuredDataset attr from a dataclass 3. Open StructuredDataset from a remote path Signed-off-by: JiaWei Jiang <[email protected]> * Delete an unmerged integration test Signed-off-by: JiaWei Jiang <[email protected]> * Try imagespec with commit sha of corresponding fix Signed-off-by: JiaWei Jiang <[email protected]> * Remove redundant test Signed-off-by: JiaWei Jiang <[email protected]> * Remove default_factory and create sd dc from input uri Signed-off-by: JiaWei Jiang <[email protected]> * refactor: Clean test logic 1. Remove redundant prints 2. Use `mock.patch.dict` to setup `os.environ` for the current test fn * Avoid contaminating other tests running in the same process Signed-off-by: JiaWei Jiang <[email protected]> * Remove redundant minio env var setup and add test comments Signed-off-by: JiaWei Jiang <[email protected]> * Support uploading tmp pqt file Signed-off-by: JiaWei Jiang <[email protected]> * Udpate deprecated module Signed-off-by: JiaWei Jiang <[email protected]> * Remove redundant and unused imports Signed-off-by: JiaWei Jiang <[email protected]> --------- Signed-off-by: JiaWei Jiang <[email protected]> Signed-off-by: Shuying Liang <[email protected]>
…rg#3043) Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Shuying Liang <[email protected]>
* make _downloader function in FlyteFile/Directory pickleable Signed-off-by: Niels Bantilan <[email protected]> * make FlyteFile and Directory pickleable Signed-off-by: Niels Bantilan <[email protected]> * remove unnecessary helper functions Signed-off-by: Niels Bantilan <[email protected]> * fix lint Signed-off-by: Niels Bantilan <[email protected]> * use partials instead of lambda Signed-off-by: Niels Bantilan <[email protected]> * fix lint Signed-off-by: Niels Bantilan <[email protected]> * remove unneeded helper function Signed-off-by: Niels Bantilan <[email protected]> * update FlyteFilePathTransformer.downloader method Signed-off-by: Niels Bantilan <[email protected]> * remove downloader staticmethod Signed-off-by: Niels Bantilan <[email protected]> * fix lint Signed-off-by: Niels Bantilan <[email protected]> --------- Signed-off-by: Niels Bantilan <[email protected]> Signed-off-by: Shuying Liang <[email protected]>
Signed-off-by: Shuying Liang <[email protected]>
43e2733
to
391df53
Compare
Code Review Agent Run #b617d8Actionable Suggestions - 6
Additional Suggestions - 2
Review Details
|
|
||
with self.assertLogs('flytekit', level='WARNING') as log: | ||
kube_config.load_kube_config() | ||
self.assertIn("Failed to load in-cluster configuration.", log.output[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding more specific assertions for the warning message content. The current assertion only checks for a substring which could potentially match unintended messages.
Code suggestion
Check the AI-generated fix before applying
self.assertIn("Failed to load in-cluster configuration.", log.output[-1]) | |
self.assertEqual(f"WARNING:flytekit:Failed to load in-cluster configuration. In-cluster config not found.", log.output[-1]) |
Code Review Run #b617d8
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
plugins/flytekit-k8sdataservice/flytekitplugins/k8sdataservice/k8s/manager.py
Show resolved
Hide resolved
) | ||
return "success" | ||
|
||
if status.replicas > 0 and status.available_replicas is not None and status.available_replicas >= 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider simplifying the condition status.available_replicas is not None and status.available_replicas >= 0
to just status.available_replicas >= 0
since checking for non-None is redundant when comparing with 0.
Code suggestion
Check the AI-generated fix before applying
if status.replicas > 0 and status.available_replicas is not None and status.available_replicas >= 0: | |
if status.replicas > 0 and status.available_replicas >= 0: |
Code Review Run #b617d8
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
task_metadata = task.TaskMetadata( | ||
discoverable= True, | ||
runtime=task.RuntimeMetadata(task.RuntimeMetadata.RuntimeType.FLYTE_SDK, "1.0.0", "python"), | ||
timeout=timedelta(days=1), | ||
retries=literals.RetryStrategy(3), | ||
interruptible=True, | ||
discovery_version="0.1.1b0", | ||
deprecated_error_message="This is deprecated!", | ||
cache_serializable=True, | ||
pod_template_name="A", | ||
cache_ignore_input_vars=(), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider consolidating the task metadata initialization by extracting common values into constants or helper functions. The current implementation has repeated task metadata setup across multiple test cases.
Code suggestion
Check the AI-generated fix before applying
task_metadata = task.TaskMetadata( | |
discoverable= True, | |
runtime=task.RuntimeMetadata(task.RuntimeMetadata.RuntimeType.FLYTE_SDK, "1.0.0", "python"), | |
timeout=timedelta(days=1), | |
retries=literals.RetryStrategy(3), | |
interruptible=True, | |
discovery_version="0.1.1b0", | |
deprecated_error_message="This is deprecated!", | |
cache_serializable=True, | |
pod_template_name="A", | |
cache_ignore_input_vars=(), | |
) | |
task_metadata = create_test_task_metadata() |
Code Review Run #b617d8
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
Name: Optional[str] = None | ||
Requests: Optional[Resources] = None | ||
Limits: Optional[Resources] = None | ||
Port: Optional[int] = None | ||
Image: Optional[str] = None | ||
Command: Optional[List[str]] = None | ||
Replicas: Optional[int] = None | ||
ExistingReleaseName: Optional[str] = None | ||
Cluster: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using snake_case for attribute names in DataServiceConfig
class to follow Python naming conventions. Attributes like Name
, Requests
, Limits
etc. should be lowercase.
Code suggestion
Check the AI-generated fix before applying
Name: Optional[str] = None | |
Requests: Optional[Resources] = None | |
Limits: Optional[Resources] = None | |
Port: Optional[int] = None | |
Image: Optional[str] = None | |
Command: Optional[List[str]] = None | |
Replicas: Optional[int] = None | |
ExistingReleaseName: Optional[str] = None | |
Cluster: Optional[str] = None | |
name: Optional[str] = None | |
requests: Optional[Resources] = None | |
limits: Optional[Resources] = None | |
port: Optional[int] = None | |
image: Optional[str] = None | |
command: Optional[List[str]] = None | |
replicas: Optional[int] = None | |
existing_release_name: Optional[str] = None | |
cluster: Optional[str] = None |
Code Review Run #b617d8
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
data_service_name = self.release_name | ||
logger.info(f"Sensor got the release name: {self.release_name}") | ||
try: | ||
# Delete the Service associated with the graph engine | ||
self.core_v1_api.delete_namespaced_service( | ||
name=data_service_name, namespace=self.namespace, body=client.V1DeleteOptions() | ||
) | ||
logger.info(f"Deleted Service: {data_service_name}") | ||
except ApiException as e: | ||
logger.error(f"Error deleting Service: {e}") | ||
|
||
try: | ||
# Delete the StatefulSet associated with the graph engine | ||
self.apps_v1_api.delete_namespaced_stateful_set( | ||
name=data_service_name, namespace=self.namespace, body=client.V1DeleteOptions() | ||
) | ||
logger.info(f"Deleted StatefulSet: {data_service_name}") | ||
except ApiException as e: | ||
logger.error(f"Error deleting StatefulSet: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling for Service and StatefulSet deletion is duplicated. Consider consolidating the error handling logic into a helper method.
Code suggestion
Check the AI-generated fix before applying
data_service_name = self.release_name | |
logger.info(f"Sensor got the release name: {self.release_name}") | |
try: | |
# Delete the Service associated with the graph engine | |
self.core_v1_api.delete_namespaced_service( | |
name=data_service_name, namespace=self.namespace, body=client.V1DeleteOptions() | |
) | |
logger.info(f"Deleted Service: {data_service_name}") | |
except ApiException as e: | |
logger.error(f"Error deleting Service: {e}") | |
try: | |
# Delete the StatefulSet associated with the graph engine | |
self.apps_v1_api.delete_namespaced_stateful_set( | |
name=data_service_name, namespace=self.namespace, body=client.V1DeleteOptions() | |
) | |
logger.info(f"Deleted StatefulSet: {data_service_name}") | |
except ApiException as e: | |
logger.error(f"Error deleting StatefulSet: {e}") | |
def delete_resource(resource_type: str, delete_fn): | |
try: | |
delete_fn( | |
name=self.release_name, | |
namespace=self.namespace, | |
body=client.V1DeleteOptions() | |
) | |
logger.info(f"Deleted {resource_type}: {self.release_name}") | |
except ApiException as e: | |
logger.error(f"Error deleting {resource_type}: {e}") | |
logger.info(f"Sensor got the release name: {self.release_name}") | |
delete_resource("Service", self.core_v1_api.delete_namespaced_service) | |
delete_resource("StatefulSet", self.apps_v1_api.delete_namespaced_stateful_set) |
Code Review Run #b617d8
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
Why are the changes needed?
Graph Neural Networks are critical for understanding complex relationships across LinkedIn's professional networks. However, training these models at scale involves intricate data loading, sampling, and processing across multiple nodes and GPUs. The missing piece is the infrastructure to support how and where to run these Kubernetes data services, making them scalable and reliable along with the training or inference processes.
To simplify the complex orchestration pipeline, we decided to leverage flyte agent framework to provision and manage the data services for GNN use case.
What changes were proposed in this pull request?
This PR adds the flyte agent to create/update/delete the K8s statefulset and service.
How was this patch tested?
MPIJobs
(for deep learning GNN training) orTFJob
(for offline inference)Setup process
pip install flytekitplugins-k8sdataservice
Screenshots
Check all the applicable boxes
Docs link
Blog from Flyte community sync
Summary by Bito
This PR introduces a new K8s data service plugin for Flyte, specifically designed to support GNN training workloads. The implementation includes a DataServiceAgent for managing K8s resources (StatefulSets and Services), a CleanupSensor for resource cleanup, and comprehensive resource management functionality. The changes are accompanied by extensive unit tests and documentation.Unit tests added: True
Estimated effort to review (1-5, lower is better): 5