Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to emit metric to the target AMP workspace #484

Closed
wants to merge 0 commits into from

Conversation

weicongw
Copy link
Contributor

Add support to emit metric to the target Amazon Managed Service for Prometheus workspace
Beta

Issue #, if available:

Description of changes:

  • Add support to emit metric to the target Amazon Managed Service for Prometheus workspace
  • The test support emitting metric from cross account cross region
  • If amp url is not set, the test will not emitting metics
  • Emit NCCL test avg bus bandwith metric
  • Add metadata label to the metric
  • Add/update readme

Test

go test -timeout 60m -v . -args -nvidiaTestImage public.ecr.aws/o5d5x8n6/weicongw:nvidia --efaEnabled=true --feature=multi-node --ampMetricUrl=https://aps-workspaces.us-west-2.amazonaws.com/workspaces/ws-9f8fe538-f707-46e7-863c-26bfb192dc52/api/v1/remote_write --ampMetricRoleArn=arn:aws:iam::665181186642:role/amp
...
        [1,0]<stdout>:# Out of bounds values : 0 OK
        [1,0]<stdout>:# Avg bus bandwidth    : 3.68456 
        [1,0]<stdout>:#
        [1,0]<stdout>:
        
    mpi_test.go:145: Emitting nccl test metrics to AMP

Query the metric from AMP

export AMP_QUERY_ENDPOINT=https://aps-workspaces.us-west-2.amazonaws.com/workspaces/ws-9f8fe538-f707-46e7-863c-26bfb192dc52/api/v1/query

awscurl -X POST --region us-west-2 \
--service aps "${AMP_QUERY_ENDPOINT}" \
-d 'query=nccl_average_bandwidth_gbps[60m]' \
--header 'Content-Type: application/x-www-form-urlencoded'

{"status":"success","data":{"resultType":"matrix","result":[{"metric":
{"__name__":"nccl_average_bandwidth_gbps","ami_id":"ami-0cd7612ff47454cd6",
"aws_ofi_nccl_version":"1.9.1","efa_count":"1","efa_enabled":"true",
"efa_installer_version":"1.34.0","instance_type":"p4de.24xlarge",
"kubernetes_version":"1.30+","nccl_version":"2.18.5","node_count":"2",
"nvidia_driver_version":"550.90.07","os_type":"Amazon Linux 2"},
"values":[[1726791286.534,"3.62432"],[1726794564.87,"3.68456"]]}]}}

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant