Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

* refactor(sdk): added option for custom metric collector for tune in… #2406

Closed
wants to merge 13 commits into from
Closed
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 20 additions & 5 deletions sdk/python/v1beta1/kubeflow/katib/api/katib_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,10 @@ def tune(
retain_trials: bool = False,
packages_to_install: List[str] = None,
pip_index_url: str = "https://pypi.org/simple",
metrics_collector_config: Dict[str, Any] = {"kind": "StdOut"},
metrics_collector_config: Dict[str, Any] = {
"kind": "StdOut",
"custom_collector": None
},
):
"""Create HyperParameter Tuning Katib Experiment from the objective function.

Expand Down Expand Up @@ -253,9 +256,18 @@ def tune(
to the base image packages. These packages are installed before
executing the objective function.
pip_index_url: The PyPI url from which to install Python packages.
metrics_collector_config: Specify the config of metrics collector,
for example, `metrics_collector_config = {"kind": "Push"}`.
Currently, we only support `StdOut` and `Push` metrics collector.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need to tell users about the supported types of MC. So can you re-add this line?


`metrics_collector_config`: Specify the configuration for the metrics collector with following keys:
- **kind**: Specify the kind of Metrics Collector. Currently supported values are:
- `StdOut`: Collects metrics from standard output.
- `None`: No metrics collection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not supported.

Copy link
Author

@prakhar479 prakhar479 Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. Can you lead me to where I can find all supported metric collector. For the current comment I had referenced https://github.com/kubeflow/katib/blob/master/pkg/ui/v1beta1/frontend/src/app/models/experiment.k8s.model.ts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, UI also needs to be updated with the latest changes cc @Electronic-Waste
Please ref the official CRDs APIs for Metrics Collector spec: https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/common/v1beta1/common_types.go#L207-L227

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Andrey, I'll update UI with the latest changes.

- `File`: Writes metrics to a file.
- `TensorFlowEvent`: Collects metrics in TensorFlow Event format.
- `PrometheusMetric`: Exposes metrics in a Prometheus-compatible format.
- `Custom`: For custom metrics collection. Use the "custom_collector" key to specify the collector instance.

- **custom_collector**: If the `kind` is set to `Custom`, you must provide an instance of a custom `V1Container` as the value. For example:
`metrics_collector_config = {"kind" : "Custom", "custom_collector": <Instance of V1Container>}`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add support for custom_collectors in the followup PRs when we get user requests ?
I feel that Data Scientists who will use tune API doesn't need such functionality.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but can you elaborate on this a bit I am not quite sure what this intends since custom metric is already supported by underlying framework for tune function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that until we find use-cases when tune function needs to be used with Custom metrics collector, we can introduce it.
For example, I can see the value for File metrics collector, since users can write metrics into specific File during their training script, similar as for TensorFlow events.
However, for custom metrics collector, I can't see a use-cases when it might be useful with tune function.

Any thoughts @shannonbradshaw @prakhar479 @johnugeorge @tenzen-y ?


Raises:
ValueError: Function arguments have incorrect type or value.
Expand Down Expand Up @@ -396,7 +408,10 @@ def tune(
# Up to now, we only support parameter `kind`, of which default value
# is `StdOut`, to specify the kind of metrics collector.
experiment.spec.metrics_collector_spec = models.V1beta1MetricsCollectorSpec(
collector=models.V1beta1CollectorSpec(kind=metrics_collector_config["kind"])
collector=models.V1beta1CollectorSpec(
kind=metrics_collector_config["kind"],
custom_collector=metrics_collector_config["custom_collector"],
)
)

# Create Trial specification.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
FROM python:3.8-slim

WORKDIR /app

COPY test/e2e/v1beta1/scripts/gh-actions/dummy-collector.py .

RUN pip install kubernetes

CMD ["python", "dummy-collector.py"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this container ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This container is meant for serving as a dummy metric collector container intended for e2e test.

3 changes: 3 additions & 0 deletions test/e2e/v1beta1/scripts/gh-actions/build-load.sh
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,9 @@ done
if "$TUNE_API"; then
echo -e "\nPulling and building testing image for tune function..."
_build_containers "suggestion-hyperopt" "$CMD_PREFIX/suggestion/hyperopt/$VERSION/Dockerfile"

echo -e "\nBuilding dummy collector image..."
_build_containers "dummy-collector" "test/e2e/v1beta1/scripts/gh-actions/Dockerfile.dummy-collector"
fi

echo -e "\nCleanup Build Cache...\n"
Expand Down
27 changes: 27 additions & 0 deletions test/e2e/v1beta1/scripts/gh-actions/dummy-collector.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
import argparse
import logging
import time

from kubernetes import client, config

# The default logging config.
logging.basicConfig(level=logging.INFO)

def collect_metrics(metric_name : str):
config.load_incluster_config()
v1 = client.CoreV1Api()

while True:
dummy_metric_value = 42
logging.info(f"Collected dummy metric: {metric_name}={dummy_metric_value}")

time.sleep(10) # Collect metrics every 10 seconds

if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--metric-name", type=str, required=True, help="Name of the metric to collect")
args = parser.parse_args()

collect_metrics(args.metric_name)


89 changes: 84 additions & 5 deletions test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

from kubeflow.katib import KatibClient, search
from kubernetes import client
from kubernetes.client import V1Container
from verify import verify_experiment_results

# Experiment timeout is 40 min.
Expand All @@ -11,8 +12,68 @@
# The default logging config.
logging.basicConfig(level=logging.INFO)

def run_e2e_experiment_create_by_tune_custom_metrics_collector(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need another E2E just to test this functionality.
We already run E2Es using these YAML files for various metrics collectors: https://github.com/kubeflow/katib/tree/master/examples/v1beta1/metrics-collector
I think for your feature, you just need to add unit tests for Katib Client to verify that Experiment has correct specification: https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/katib_client_test.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! should I then remove the e2e test with custom metric collector from run-e2e-tune-api.py?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

katib_client: KatibClient,
exp_name: str,
exp_namespace: str,
):
# Create Katib Experiment and wait until it is finished.
logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name))

# Use the test case from get-started tutorial.
# https://www.kubeflow.org/docs/components/katib/getting-started/#getting-started-with-katib-python-sdk
# [1] Create an objective function.
def objective(parameters):
import time
time.sleep(5)
result = 4 * int(parameters["a"]) - float(parameters["b"]) ** 2
print(f"result={result}")

# [2] Create hyperparameter search space.
parameters = {
"a": search.int(min=10, max=20),
"b": search.double(min=0.1, max=0.2)
}

# [3] Create a dummy metric collector (DOES NOT HAVE A IMAGE)
metric_collector = V1Container(
name="dummy-collector",
image="dummy-collector:latest",
command=["python", "/app/dummy-collector.py"],
args=["--metric-name=result"],
env=[
client.V1EnvVar(name="EXPERIMENT_NAME", value=exp_name),
client.V1EnvVar(name="EXPERIMENT_NAMESPACE", value=exp_namespace)
]
)

# [4] Create Katib Experiment with 4 Trials and 2 CPUs per Trial.
# And Wait until Experiment reaches Succeeded condition.
katib_client.tune(
name=exp_name,
namespace=exp_namespace,
objective=objective,
parameters=parameters,
objective_metric_name="result",
max_trial_count=4,
resources_per_trial={"cpu": "2"},
metrics_collector_config={
"kind": "Custom",
"custom_collector": metric_collector,
},
)
experiment = katib_client.wait_for_experiment_condition(
exp_name, exp_namespace, timeout=EXPERIMENT_TIMEOUT
)

# Verify the Experiment results.
verify_experiment_results(katib_client, experiment, exp_name, exp_namespace)

# Print the Experiment and Suggestion.
logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))

def run_e2e_experiment_create_by_tune(
def run_e2e_experiment_create_by_tune_default_metrics_collector(
katib_client: KatibClient,
exp_name: str,
exp_namespace: str,
Expand Down Expand Up @@ -57,7 +118,6 @@ def objective(parameters):
logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
Expand All @@ -78,11 +138,12 @@ def objective(parameters):
namespace_labels['katib.kubeflow.org/metrics-collector-injection'] = 'enabled'
client.CoreV1Api().patch_namespace(args.namespace, {'metadata': {'labels': namespace_labels}})

# Test with run_e2e_experiment_create_by_tune
exp_name = "tune-example"
# Test with run_e2e_experiment_create_by_tune_default_metrics_collector
exp_namespace = args.namespace
try:
run_e2e_experiment_create_by_tune(katib_client, exp_name, exp_namespace)
exp_name = "tune-example-default-metrics-collector"
logging.info(f"Runnning E2E for Experiment created by tune: {exp_namespace}/{exp_name}")
run_e2e_experiment_create_by_tune_default_metrics_collector(katib_client, exp_name, exp_namespace)
logging.info("---------------------------------------------------------------")
logging.info(f"E2E is succeeded for Experiment created by tune: {exp_namespace}/{exp_name}")
except Exception as e:
Expand All @@ -94,3 +155,21 @@ def objective(parameters):
logging.info("---------------------------------------------------------------")
logging.info("---------------------------------------------------------------")
katib_client.delete_experiment(exp_name, exp_namespace)


# Test with run_e2e_experiment_create_by_tune_custom_metrics_collector
try:
exp_name = "tune-example-custom-metrics-collector"
logging.info(f"Runnning E2E for Experiment created by tune: {exp_namespace}/{exp_name}")
run_e2e_experiment_create_by_tune_custom_metrics_collector(katib_client, exp_name, exp_namespace)
logging.info("---------------------------------------------------------------")
logging.info(f"E2E is succeeded for Experiment created by tune: {exp_namespace}/{exp_name}")
except Exception as e:
logging.info("---------------------------------------------------------------")
logging.info(f"E2E is failed for Experiment created by tune: {exp_namespace}/{exp_name}")
raise e
finally:
# Delete the Experiment.
logging.info("---------------------------------------------------------------")
logging.info("---------------------------------------------------------------")
katib_client.delete_experiment(exp_name, exp_namespace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to delete the former experiment before we run another experiment. Otherwise, we may run into xxx experiment alreay exists error.

Loading