-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
* refactor(sdk): added option for custom metric collector for tune in… #2406
Changes from 11 commits
6c85385
b746da0
516cabe
67c9b78
a0a7e53
80a718c
9245bcd
83ecb32
2d8ecd0
2222b99
3f4ea55
a51b863
1aa9c48
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -188,7 +188,10 @@ def tune( | |
retain_trials: bool = False, | ||
packages_to_install: List[str] = None, | ||
pip_index_url: str = "https://pypi.org/simple", | ||
metrics_collector_config: Dict[str, Any] = {"kind": "StdOut"}, | ||
metrics_collector_config: Dict[str, Any] = { | ||
"kind": "StdOut", | ||
"custom_collector": None | ||
}, | ||
): | ||
"""Create HyperParameter Tuning Katib Experiment from the objective function. | ||
|
||
|
@@ -253,9 +256,18 @@ def tune( | |
to the base image packages. These packages are installed before | ||
executing the objective function. | ||
pip_index_url: The PyPI url from which to install Python packages. | ||
metrics_collector_config: Specify the config of metrics collector, | ||
for example, `metrics_collector_config = {"kind": "Push"}`. | ||
Currently, we only support `StdOut` and `Push` metrics collector. | ||
|
||
`metrics_collector_config`: Specify the configuration for the metrics collector with following keys: | ||
- **kind**: Specify the kind of Metrics Collector. Currently supported values are: | ||
- `StdOut`: Collects metrics from standard output. | ||
- `None`: No metrics collection. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not supported. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for pointing this out. Can you lead me to where I can find all supported metric collector. For the current comment I had referenced https://github.com/kubeflow/katib/blob/master/pkg/ui/v1beta1/frontend/src/app/models/experiment.k8s.model.ts There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, UI also needs to be updated with the latest changes cc @Electronic-Waste There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks Andrey, I'll update UI with the latest changes. |
||
- `File`: Writes metrics to a file. | ||
- `TensorFlowEvent`: Collects metrics in TensorFlow Event format. | ||
- `PrometheusMetric`: Exposes metrics in a Prometheus-compatible format. | ||
- `Custom`: For custom metrics collection. Use the "custom_collector" key to specify the collector instance. | ||
|
||
- **custom_collector**: If the `kind` is set to `Custom`, you must provide an instance of a custom `V1Container` as the value. For example: | ||
`metrics_collector_config = {"kind" : "Custom", "custom_collector": <Instance of V1Container>}`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add support for custom_collectors in the followup PRs when we get user requests ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry but can you elaborate on this a bit I am not quite sure what this intends since custom metric is already supported by underlying framework for tune function. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I meant that until we find use-cases when Any thoughts @shannonbradshaw @prakhar479 @johnugeorge @tenzen-y ? |
||
|
||
Raises: | ||
ValueError: Function arguments have incorrect type or value. | ||
|
@@ -396,7 +408,10 @@ def tune( | |
# Up to now, we only support parameter `kind`, of which default value | ||
# is `StdOut`, to specify the kind of metrics collector. | ||
experiment.spec.metrics_collector_spec = models.V1beta1MetricsCollectorSpec( | ||
collector=models.V1beta1CollectorSpec(kind=metrics_collector_config["kind"]) | ||
collector=models.V1beta1CollectorSpec( | ||
kind=metrics_collector_config["kind"], | ||
custom_collector=metrics_collector_config["custom_collector"], | ||
) | ||
) | ||
|
||
# Create Trial specification. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
FROM python:3.8-slim | ||
|
||
WORKDIR /app | ||
|
||
COPY test/e2e/v1beta1/scripts/gh-actions/dummy-collector.py . | ||
|
||
RUN pip install kubernetes | ||
|
||
CMD ["python", "dummy-collector.py"] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need this container ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This container is meant for serving as a dummy metric collector container intended for e2e test. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
import argparse | ||
import logging | ||
import time | ||
|
||
from kubernetes import client, config | ||
|
||
# The default logging config. | ||
logging.basicConfig(level=logging.INFO) | ||
|
||
def collect_metrics(metric_name : str): | ||
config.load_incluster_config() | ||
v1 = client.CoreV1Api() | ||
|
||
while True: | ||
dummy_metric_value = 42 | ||
logging.info(f"Collected dummy metric: {metric_name}={dummy_metric_value}") | ||
|
||
time.sleep(10) # Collect metrics every 10 seconds | ||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument("--metric-name", type=str, required=True, help="Name of the metric to collect") | ||
args = parser.parse_args() | ||
|
||
collect_metrics(args.metric_name) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,6 +3,7 @@ | |
|
||
from kubeflow.katib import KatibClient, search | ||
from kubernetes import client | ||
from kubernetes.client import V1Container | ||
from verify import verify_experiment_results | ||
|
||
# Experiment timeout is 40 min. | ||
|
@@ -11,8 +12,68 @@ | |
# The default logging config. | ||
logging.basicConfig(level=logging.INFO) | ||
|
||
def run_e2e_experiment_create_by_tune_custom_metrics_collector( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we need another E2E just to test this functionality. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure! should I then remove the e2e test with custom metric collector from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, please can you add the unit tests for your functionality: https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/katib_client_test.py ? |
||
katib_client: KatibClient, | ||
exp_name: str, | ||
exp_namespace: str, | ||
): | ||
# Create Katib Experiment and wait until it is finished. | ||
logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name)) | ||
|
||
# Use the test case from get-started tutorial. | ||
# https://www.kubeflow.org/docs/components/katib/getting-started/#getting-started-with-katib-python-sdk | ||
# [1] Create an objective function. | ||
def objective(parameters): | ||
import time | ||
time.sleep(5) | ||
result = 4 * int(parameters["a"]) - float(parameters["b"]) ** 2 | ||
print(f"result={result}") | ||
|
||
# [2] Create hyperparameter search space. | ||
parameters = { | ||
"a": search.int(min=10, max=20), | ||
"b": search.double(min=0.1, max=0.2) | ||
} | ||
|
||
# [3] Create a dummy metric collector (DOES NOT HAVE A IMAGE) | ||
metric_collector = V1Container( | ||
name="dummy-collector", | ||
image="dummy-collector:latest", | ||
command=["python", "/app/dummy-collector.py"], | ||
args=["--metric-name=result"], | ||
env=[ | ||
client.V1EnvVar(name="EXPERIMENT_NAME", value=exp_name), | ||
client.V1EnvVar(name="EXPERIMENT_NAMESPACE", value=exp_namespace) | ||
] | ||
) | ||
|
||
# [4] Create Katib Experiment with 4 Trials and 2 CPUs per Trial. | ||
# And Wait until Experiment reaches Succeeded condition. | ||
katib_client.tune( | ||
name=exp_name, | ||
namespace=exp_namespace, | ||
objective=objective, | ||
parameters=parameters, | ||
objective_metric_name="result", | ||
max_trial_count=4, | ||
resources_per_trial={"cpu": "2"}, | ||
metrics_collector_config={ | ||
"kind": "Custom", | ||
"custom_collector": metric_collector, | ||
}, | ||
) | ||
experiment = katib_client.wait_for_experiment_condition( | ||
exp_name, exp_namespace, timeout=EXPERIMENT_TIMEOUT | ||
) | ||
|
||
# Verify the Experiment results. | ||
verify_experiment_results(katib_client, experiment, exp_name, exp_namespace) | ||
|
||
# Print the Experiment and Suggestion. | ||
logging.debug(katib_client.get_experiment(exp_name, exp_namespace)) | ||
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace)) | ||
|
||
def run_e2e_experiment_create_by_tune( | ||
def run_e2e_experiment_create_by_tune_default_metrics_collector( | ||
katib_client: KatibClient, | ||
exp_name: str, | ||
exp_namespace: str, | ||
|
@@ -57,7 +118,6 @@ def objective(parameters): | |
logging.debug(katib_client.get_experiment(exp_name, exp_namespace)) | ||
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace)) | ||
|
||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument( | ||
|
@@ -78,11 +138,12 @@ def objective(parameters): | |
namespace_labels['katib.kubeflow.org/metrics-collector-injection'] = 'enabled' | ||
client.CoreV1Api().patch_namespace(args.namespace, {'metadata': {'labels': namespace_labels}}) | ||
|
||
# Test with run_e2e_experiment_create_by_tune | ||
exp_name = "tune-example" | ||
# Test with run_e2e_experiment_create_by_tune_default_metrics_collector | ||
exp_namespace = args.namespace | ||
try: | ||
run_e2e_experiment_create_by_tune(katib_client, exp_name, exp_namespace) | ||
exp_name = "tune-example-default-metrics-collector" | ||
logging.info(f"Runnning E2E for Experiment created by tune: {exp_namespace}/{exp_name}") | ||
run_e2e_experiment_create_by_tune_default_metrics_collector(katib_client, exp_name, exp_namespace) | ||
logging.info("---------------------------------------------------------------") | ||
logging.info(f"E2E is succeeded for Experiment created by tune: {exp_namespace}/{exp_name}") | ||
except Exception as e: | ||
|
@@ -94,3 +155,21 @@ def objective(parameters): | |
logging.info("---------------------------------------------------------------") | ||
logging.info("---------------------------------------------------------------") | ||
katib_client.delete_experiment(exp_name, exp_namespace) | ||
|
||
|
||
# Test with run_e2e_experiment_create_by_tune_custom_metrics_collector | ||
try: | ||
exp_name = "tune-example-custom-metrics-collector" | ||
logging.info(f"Runnning E2E for Experiment created by tune: {exp_namespace}/{exp_name}") | ||
run_e2e_experiment_create_by_tune_custom_metrics_collector(katib_client, exp_name, exp_namespace) | ||
logging.info("---------------------------------------------------------------") | ||
logging.info(f"E2E is succeeded for Experiment created by tune: {exp_namespace}/{exp_name}") | ||
except Exception as e: | ||
logging.info("---------------------------------------------------------------") | ||
logging.info(f"E2E is failed for Experiment created by tune: {exp_namespace}/{exp_name}") | ||
raise e | ||
finally: | ||
# Delete the Experiment. | ||
logging.info("---------------------------------------------------------------") | ||
logging.info("---------------------------------------------------------------") | ||
katib_client.delete_experiment(exp_name, exp_namespace) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to delete the former experiment before we run another experiment. Otherwise, we may run into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may need to tell users about the supported types of MC. So can you re-add this line?