-
Notifications
You must be signed in to change notification settings - Fork 262
[Proposal] Support GPU #541
Comments
@YujiOshima - I have a few questions:
|
|
Thank you for your comments. I think there are a couple of areas we can work on: 1. Metrics / health
Unless there are specific requirements, I think instead of developing local agent, we should look at integration with Prometheus. We can then install node_exporter and nvidia_exporter on the host. Monitoring and alerting would then follow the standard Prometheus set up (e.g. Grafana, etc) If we care about cluster autoscaling, then I see an area of integration where alerts or thresholds from Prometheus can trigger events to Infrakit to scale up the cluster. This can mean
2. Interaction with higher-level orchestrators (e.g. k8s / swarm).There are two directions:
For the first one, we already have the For the second case, the outside-facing API is not yet defined. As a concrete example, for k8s, we can create an API that easily maps to their This is what I meant when I suggested "Infrakit Apps": an application-specific API that can be implemented on top of Infrakit primitives. So this would be the 'scale group' API for applications that sit on top of infrakit in the stack. It would also interface with the metrics/health in part 1 so that cluster scale up/down can be triggered by thresholds, etc. I'd like to get this started and hopefully reconcile with the work you've already done in #474 so that they fit cleanly within the overall architecture of infrakit. Is that ok with you? |
Thank you @chungers ! Rather than developing original agent, I agree to use Prometheus's metrics.
Please let me confirm my understanding. |
I think maybe the first thing we should do is to define "health". What are the metrics that you'd care about (e.g. with NVML) that would be appropriate for infrakit? In other words, what metrics, besides the host disappearing altogether, would you care to monitor so that when it changes past certain thresholds, infrakit should start spinning up a new host? From the list at NVML, I can see metrics like active compute process and the GPU utilization would be useful. Once we defined what "health" is, we can decide how to collect this data and report to infrakit. Because node exporter and the nvidia_exporter all listen on network ports, we can have a flavor plugin that polls or scrapes the data from the nodes. This is similar to how the Prometheus server would do, but if our flavor plugins can scrape the data directly, then Prometheus server can be optional. Does this seem reasonable? It's pretty easy to build a collector of prometheus data and in turn implement our flavor's Health method. I think we can even build it generically so that it works for whatever Prometheus monitors -- we just need to have the agent/exporters running on the hosts.
Yes.
Yes - in essence infrakit acts like an autoscaling group. On AWS yes you could use their ASG, but we can support specialized use cases like retaining the EBS volume where you may have checkpointed and restorable containers. This may be useful especially for training that has been running for some time and you want to be able to resume cleanly. I don't think you can do this easily with ASG. Of course, in the on-prem cases, there are no autoscaling groups so we can definitely fill that gap.
We will revisit your infrakit app PR because they are related in terms of how these "apps" fit with the rest of the architecture. I will start a PR soon and have you take a look and see if that make sense. The tricky part is that we need to have a single endpoint for higher-level systems to access, but in HA mode, our daemons run on different hosts. So this API endpoint needs to be able to route traffic to the current leader. I think the API will be REST instead of JSON-RPC or GRPC. What do you think? |
I would like to make it possible to manage GPU clusters using Infrakit.
As an idea, first of all nvidia GPU only, create a flavor to install nvidia driver and cuda.
For example, by combining cuda falvor and swarm falvor, you can build a GPU cluster with swarm.
Image of running option.
infrakit-flavor-cuda --drive=369.95 --cuda=8.0
The text was updated successfully, but these errors were encountered: