-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] On-demand single-host TPU support on GKE #3947
base: master
Are you sure you want to change the base?
[k8s] On-demand single-host TPU support on GKE #3947
Conversation
…com/gpu and google.com/tpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @landscapepainter ! Mostly looks good to me and testing now 🫡 Figured it is better to send those comments first.
sky/cli.py
Outdated
# TODO(Doyoung): Update the message with the multi-host TPU | ||
# support. | ||
k8s_per_node_acc_message = ( | ||
'Kubernetes per node accelerator availability ') | ||
if kubernetes_utils.multi_host_tpu_exists_in_cluster( | ||
context): | ||
k8s_per_node_acc_message += ( | ||
'(Note: Multi-host TPUs are not supported.)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO(Doyoung): Update the message with the multi-host TPU | |
# support. | |
k8s_per_node_acc_message = ( | |
'Kubernetes per node accelerator availability ') | |
if kubernetes_utils.multi_host_tpu_exists_in_cluster( | |
context): | |
k8s_per_node_acc_message += ( | |
'(Note: Multi-host TPUs are not supported.)') | |
# TODO(Doyoung): Update the message with the multi-host TPU | |
# support. | |
maybe_tpu_multi_host_hint = '' | |
if kubernetes_utils.multi_host_tpu_exists_in_cluster( | |
context): | |
maybe_tpu_multi_host_hint = f'Detected {xxx} node...' |
should we say sth like detected xxx node that is using multi-host gpu, skip showing them
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cblmemo I'm not convinced on why this is needed on top of the minimal noting message I already added for the following reasons:
- There can be multiple number of multi-host TPUs in user's GKE cluster. If there's like 10 of them, your suggestion would list out all of them. Not sure if this is the best UX as we are trying to keep things concise. If it's important info, we should add them, but..
- wondering if this is necessary to begin with since the users of TPUs on GKE would know what a multi-host TPU is and its existance in the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern here is that the message Multi-host TPUs are not supported.
does not convey the meaning of "we excluded some nodes from your cluster", which might confuse the users.
One way is to count the number of nodes / or number of TPUs and show xxx nodes with multi-host TPU setup is not included / excluded in the resrouces
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cblmemo I see. That makes sense and I agree to the concern. I extended a message so that the users are noted that the Multi-host TPU nodes in their GKE cluster are excluded from the display.
Kubernetes per node accelerator availability (Note: Multi-host TPUs are detected and excluded from the display as multi-host TPUs are not supported.)
NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
gke-mix-tpu-dy-default-pool-ad5bdc4d-9lw4 None 0 0
gke-mix-tpu-dy-default-pool-ad5bdc4d-bs86 None 0 0
gke-mix-tpu-dy-default-pool-ad5bdc4d-cfxn None 0 0
gke-mix-tpu-dy-default-pool-ad5bdc4d-nr5x None 0 0
gke-mix-tpu-dy-default-pool-ad5bdc4d-qgjt None 0 0
gke-mix-tpu-dy-default-pool-ad5bdc4d-rl37 None 0 0
gke-mix-tpu-dy-default-pool-ad5bdc4d-v4ts None 0 0
gke-mix-tpu-dy-default-pool-ad5bdc4d-zp2x None 0 0
gke-tpu-a3716138-984x tpu-v5-lite-podslice 4 0
gke-tpu-c5117ac4-qfzt tpu-v5-lite-podslice 1 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh are those TPU in the default pool are TPUs with multi-host? Those looks a little bit overwhelming to me. Is it possible to only show those excluded nodes when -a
? (It would also be helpful if we could add some comments at the docstr of sky show-gpus
and here:
Lines 3274 to 3275 in 42c79e1
yield ('\n\nHint: use -a/--all to see all accelerators ' | |
'(including non-common ones) and pricing.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cblmemo The ones in default pool are not multi host. Multi-host TPUs are not appearing from the list above. They are all either CPU instances or single host TPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Maybe worth filing an issue to emit them in the output as this does not looks relevant to sky show-gpus
.
sky/templates/kubernetes-ray.yml.j2
Outdated
{% if tpu_requested %} | ||
google.com/tpu: {{accelerator_count}} | ||
{% else %} | ||
nvidia.com/gpu: {{accelerator_count}} | ||
{% endif %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bump this again
I tried to launch a task in existing cluster but failed with the following error. Manually commenting out the # Create a cluster
$ sky launch --gpus tpu-v5-lite-podslice:4 -c gke-tpu-4
# Launch a task on it, manually override with --gpus
$ sky launch --gpus tpu-v5-lite-podslice:4 -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml
Task from YAML spec: /home/txia/skypilot/examples/tpu/tpuvm_mnist.yaml
Missing runtime_version in accelerator_args, using default (tpu-vm-base)
sky.exceptions.ResourcesMismatchError: Requested resources do not match the existing cluster.
Requested: {1x GCP({'tpu-v5-lite-podslice': 4}, accelerator_args={'runtime_version': 'tpu-vm-base'})}
Existing: 1x Kubernetes(2CPU--8GB--4tpu-v5-lite-podslice, {'tpu-v5-lite-podslice': 4}, accelerator_args={})
To fix: specify a new cluster name, or down the existing cluster first: sky down gke-tpu-4 |
When launching with (tpuvm_mnist, pid=3324) I1028 20:50:43.360553 135160242983552 main.py:52] JAX local devices: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)] Is this expected (e.g. the 4 of them is a single basic schedule unit)? If it is, should we disable users to specify TPUs that is less than 4 (or any other TPU count in one host)? Also, the above example failed for me. Could you check this as well? (tpuvm_mnist, pid=3325) Requirement already satisfied: clu in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (0.0.12)
(tpuvm_mnist, pid=3325) Requirement already satisfied: absl-py in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (2.1.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: etils[epath] in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (1.10.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: flax in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (0.8.2)
(tpuvm_mnist, pid=3325) Requirement already satisfied: jax in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (0.4.25)
(tpuvm_mnist, pid=3325) Requirement already satisfied: jaxlib in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (0.4.25)
(tpuvm_mnist, pid=3325) Requirement already satisfied: ml-collections in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (0.1.1)
(tpuvm_mnist, pid=3325) Requirement already satisfied: numpy in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (2.0.2)
(tpuvm_mnist, pid=3325) Requirement already satisfied: packaging in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (24.1)
(tpuvm_mnist, pid=3325) Requirement already satisfied: typing-extensions in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (4.12.2)
(tpuvm_mnist, pid=3325) Requirement already satisfied: wrapt in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (1.16.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: fsspec in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from etils[epath]->clu) (2024.10.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: importlib_resources in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from etils[epath]->clu) (6.4.5)
(tpuvm_mnist, pid=3325) Requirement already satisfied: zipp in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from etils[epath]->clu) (3.20.2)
(tpuvm_mnist, pid=3325) Requirement already satisfied: msgpack in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from flax->clu) (1.1.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: optax in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from flax->clu) (0.2.2)
(tpuvm_mnist, pid=3325) Requirement already satisfied: orbax-checkpoint in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from flax->clu) (0.5.18)
(tpuvm_mnist, pid=3325) Requirement already satisfied: tensorstore in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from flax->clu) (0.1.67)
(tpuvm_mnist, pid=3325) Requirement already satisfied: rich>=11.1 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from flax->clu) (13.9.3)
(tpuvm_mnist, pid=3325) Requirement already satisfied: PyYAML>=5.4.1 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from flax->clu) (6.0.2)
(tpuvm_mnist, pid=3325) Requirement already satisfied: ml-dtypes>=0.2.0 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from jax->clu) (0.4.1)
(tpuvm_mnist, pid=3325) Requirement already satisfied: opt-einsum in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from jax->clu) (3.4.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: scipy>=1.9 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from jax->clu) (1.14.1)
(tpuvm_mnist, pid=3325) Requirement already satisfied: six in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from ml-collections->clu) (1.16.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: contextlib2 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from ml-collections->clu) (21.6.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: markdown-it-py>=2.2.0 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from rich>=11.1->flax->clu) (3.0.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from rich>=11.1->flax->clu) (2.18.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: chex>=0.1.86 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from optax->flax->clu) (0.1.86)
(tpuvm_mnist, pid=3325) Requirement already satisfied: nest_asyncio in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from orbax-checkpoint->flax->clu) (1.6.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: protobuf in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from orbax-checkpoint->flax->clu) (3.20.3)
(tpuvm_mnist, pid=3325) Requirement already satisfied: toolz>=0.9.0 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from chex>=0.1.86->optax->flax->clu) (1.0.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: mdurl~=0.1 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich>=11.1->flax->clu) (0.1.2)
(tpuvm_mnist, pid=3325)
(tpuvm_mnist, pid=3325) A module that was compiled using NumPy 1.x cannot be run in
(tpuvm_mnist, pid=3325) NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
(tpuvm_mnist, pid=3325) versions of NumPy, modules must be compiled with NumPy 2.0.
(tpuvm_mnist, pid=3325) Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
(tpuvm_mnist, pid=3325)
(tpuvm_mnist, pid=3325) If you are a user of the module, the easiest solution will be to
(tpuvm_mnist, pid=3325) downgrade to 'numpy<2' or try to upgrade the affected module.
(tpuvm_mnist, pid=3325) We expect that some modules will need time to support NumPy 2.
(tpuvm_mnist, pid=3325)
(tpuvm_mnist, pid=3325) Traceback (most recent call last): File "/home/sky/sky_workdir/flax/examples/mnist/main.py", line 25, in <module>
(tpuvm_mnist, pid=3325) import jax
(tpuvm_mnist, pid=3325) File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/__init__.py", line 37, in <module>
(tpuvm_mnist, pid=3325) import jax.core as _core
(tpuvm_mnist, pid=3325) File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/core.py", line 18, in <module>
(tpuvm_mnist, pid=3325) from jax._src.core import (
(tpuvm_mnist, pid=3325) File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/core.py", line 38, in <module>
(tpuvm_mnist, pid=3325) from jax._src import dtypes
(tpuvm_mnist, pid=3325) File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/dtypes.py", line 33, in <module>
(tpuvm_mnist, pid=3325) from jax._src import config
(tpuvm_mnist, pid=3325) File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/config.py", line 27, in <module>
(tpuvm_mnist, pid=3325) from jax._src import lib
(tpuvm_mnist, pid=3325) File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/lib/__init__.py", line 87, in <module>
(tpuvm_mnist, pid=3325) import jaxlib.xla_client as xla_client
(tpuvm_mnist, pid=3325) File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jaxlib/xla_client.py", line 32, in <module>
(tpuvm_mnist, pid=3325) from . import xla_extension as _xla
(tpuvm_mnist, pid=3325) AttributeError: _ARRAY_API not found
(tpuvm_mnist, pid=3325) 2024-10-28 20:47:33.959159: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(tpuvm_mnist, pid=3325) WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
(tpuvm_mnist, pid=3325) E0000 00:00:1730148453.973354 3798 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(tpuvm_mnist, pid=3325) E0000 00:00:1730148453.977656 3798 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(tpuvm_mnist, pid=3325) 2024-10-28 20:47:35.939498: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
(tpuvm_mnist, pid=3325) I1028 20:47:41.511022 138348441096832 main.py:51] JAX process: 0 / 1
(tpuvm_mnist, pid=3325) I1028 20:47:41.511178 138348441096832 main.py:52] JAX local devices: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
(tpuvm_mnist, pid=3325) I1028 20:47:41.511398 138348441096832 local.py:45] Setting task status: process_index: 0, process_count: 1
(tpuvm_mnist, pid=3325) I1028 20:47:41.511511 138348441096832 local.py:50] Created artifact workdir of type ArtifactType.DIRECTORY and value /tmp/mnist.
(tpuvm_mnist, pid=3325) I1028 20:47:42.905525 138348441096832 dataset_info.py:805] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: mnist/3.0.1
(tpuvm_mnist, pid=3325) I1028 20:47:43.162729 138348441096832 dataset_info.py:617] Load dataset info from /tmp/tmpfaoy19getfds
(tpuvm_mnist, pid=3325) I1028 20:47:43.164796 138348441096832 dataset_info.py:709] For 'mnist/3.0.1': fields info.[citation, splits, supervised_keys, module_name] differ on disk and in the code. Keeping the one from code.
(tpuvm_mnist, pid=3325) I1028 20:47:43.165012 138348441096832 dataset_builder.py:644] Generating dataset mnist (/home/sky/tensorflow_datasets/mnist/3.0.1)
(tpuvm_mnist, pid=3325) Downloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /home/sky/tensorflow_datasets/mnist/3.0.1...
(tpuvm_mnist, pid=3325) I1028 20:47:43.289096 138348441096832 dataset_builder.py:693] Dataset mnist is hosted on GCS. It will automatically be downloaded to your
(tpuvm_mnist, pid=3325) local data directory. If you'd instead prefer to read directly from our public
(tpuvm_mnist, pid=3325) GCS bucket (recommended if you're running on GCP), you can instead pass
(tpuvm_mnist, pid=3325) `try_gcs=True` to `tfds.load` or set `data_dir=gs://tfds-data/datasets`.
(tpuvm_mnist, pid=3325)
Dl Completed...: 100%|██████████| 5/5 [00:00<00:00, 14.90 file/s]:00<00:00, 4.84 file/s]
(tpuvm_mnist, pid=3325) I1028 20:47:43.678423 138348441096832 dataset_info.py:617] Load dataset info from /home/sky/tensorflow_datasets/mnist/incomplete.1RCFAI_3.0.1/
(tpuvm_mnist, pid=3325) I1028 20:47:43.679890 138348441096832 dataset_info.py:709] For 'mnist/3.0.1': fields info.[citation, splits, supervised_keys, module_name, file_format] differ on disk and in the code. Keeping the one from code.
(tpuvm_mnist, pid=3325) Dataset mnist downloaded and prepared to /home/sky/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.
(tpuvm_mnist, pid=3325) I1028 20:47:43.681097 138348441096832 reader.py:261] Creating a tf.data.Dataset reading 1 files located in folders: /home/sky/tensorflow_datasets/mnist/3.0.1.
(tpuvm_mnist, pid=3325) I1028 20:47:44.686084 138348441096832 logging_logger.py:49] Constructing tf.data.Dataset mnist for split train, from /home/sky/tensorflow_datasets/mnist/3.0.1
(tpuvm_mnist, pid=3325) I1028 20:47:44.686940 138348441096832 reader.py:261] Creating a tf.data.Dataset reading 1 files located in folders: /home/sky/tensorflow_datasets/mnist/3.0.1.
(tpuvm_mnist, pid=3325) I1028 20:47:44.888163 138348441096832 logging_logger.py:49] Constructing tf.data.Dataset mnist for split test, from /home/sky/tensorflow_datasets/mnist/3.0.1
(tpuvm_mnist, pid=3325) Traceback (most recent call last):
(tpuvm_mnist, pid=3325) File "/home/sky/sky_workdir/flax/examples/mnist/main.py", line 69, in <module>
(tpuvm_mnist, pid=3325) app.run(main)
(tpuvm_mnist, pid=3325) File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/absl/app.py", line 308, in run
(tpuvm_mnist, pid=3325) _run_main(main, args)
(tpuvm_mnist, pid=3325) File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
(tpuvm_mnist, pid=3325) sys.exit(main(argv))
(tpuvm_mnist, pid=3325) File "/home/sky/sky_workdir/flax/examples/mnist/main.py", line 64, in main
(tpuvm_mnist, pid=3325) train.train_and_evaluate(FLAGS.config, FLAGS.workdir)
(tpuvm_mnist, pid=3325) File "/home/sky/sky_workdir/flax/examples/mnist/train.py", line 130, in train_and_evaluate
(tpuvm_mnist, pid=3325) train_ds, test_ds = get_datasets()
(tpuvm_mnist, pid=3325) File "/home/sky/sky_workdir/flax/examples/mnist/train.py", line 105, in get_datasets
(tpuvm_mnist, pid=3325) train_ds['image'] = jnp.float32(train_ds['image']) / 255.0
(tpuvm_mnist, pid=3325) File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/numpy/lax_numpy.py", line 152, in __call__
(tpuvm_mnist, pid=3325) return asarray(x, dtype=self.dtype)
(tpuvm_mnist, pid=3325) File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/numpy/lax_numpy.py", line 2233, in asarray
(tpuvm_mnist, pid=3325) return array(a, dtype=dtype, copy=bool(copy), order=order) # type: ignore
(tpuvm_mnist, pid=3325) File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/numpy/lax_numpy.py", line 2174, in array
(tpuvm_mnist, pid=3325) out = np.array(object, dtype=dtype, ndmin=ndmin, copy=False) # type: ignore[arg-type]
(tpuvm_mnist, pid=3325) ValueError: Unable to avoid copy while creating an array as requested.
(tpuvm_mnist, pid=3325) If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
(tpuvm_mnist, pid=3325) For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.
ERROR: Job 1 failed with return code list: [1]
✓ Job finished (status: FAILED). |
Also, when I tries to launch with 2 tpus, it seems the error does not illustrate the real reason (e.g. only tpu:4 and tpu:1 is available). $ sky launch --gpus tpu-v5-lite-podslice:2 -c gke-tpu-2
No resource satisfying Kubernetes({'tpu-v5-lite-podslice': 2}, accelerator_args={}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'tpu-v5-lite-podslice': 2}, accelerator_args={}).
To fix: relax or change the resource requirements.
Hint: sky show-gpus to list available accelerators.
sky check to check the enabled clouds. Should we show similar fuzzy result like this? $ sky launch --gpus A100:3
No resource satisfying <Cloud>({'A100': 3}) on [Kubernetes, Lambda, GCP, Azure, AWS, RunPod].
Did you mean: ['A100-80GB-SXM:4', 'A100-80GB-SXM:8', 'A100-80GB:4', 'A100-80GB:8', 'A100:16', 'A100:4', 'A100:8']
sky.exceptions.ResourcesUnavailableError: Catalog and kubernetes cluster does not contain any instances satisfying the request: 1x <Cloud>({'A100': 3}).
To fix: relax or change the resource requirements.
Try one of these offered accelerators: ['A100-80GB-SXM:4', 'A100-80GB-SXM:8', 'A100-80GB:4', 'A100-80GB:8', 'A100:16', 'A100:4', 'A100:8']
Hint: sky show-gpus to list available accelerators.
sky check to check the enabled clouds. |
@cblmemo Seems like this is a consistent behavior for
|
@cblmemo |
Got it. Maybe worth filing an issue for this and implement this elsewhere ;) |
(sky-serve) ➜ skypilot git:(new_provision_api) ✗ sky launch @temp/a.yaml --instance-type n2-standard-8
Task from YAML spec: @temp/a.yaml
ValueError: Invalid instance type 'n2-standard-8' for cloud AWS.
(sky-serve) ➜ skypilot git:(new_provision_api) ✗ cat @temp/a.yaml
resources:
cloud: aws At least we should show such error information? Current error is a little bit confusing to me.. Also, current conflict is two auto-filled cloud's conflict. If the user explicitly set the cloud in the YAML and cause a conflict, that sounds reasonable to me. But I would be surprised if I didn't set the cloud but two of SkyPilot's auto inference of cloud is causing conflict.. |
@cblmemo Not sure why that error is printing out info regards to TPU with 4 chips, but it's fixed at 688c0b4.
Also, it does not detect TPU with 4 chips when a pod with 1 TPU chip is provisioned:
|
I just tested this out again by specifying
, because skypilot knows that So the fuzzy error message is supposed to appear only when either when the cloud is not specified but the instance is available in multiple clouds. I guess there isn't anything to file an issue to fix? |
@cblmemo @romilbhardwaj This is ready for another round. Thanks!! |
I see. LGTM |
Oh sorry I mean launch with cc @romilbhardwaj for a look here |
To reproduce: $ sky launch examples/tpu/tpuvm_mnist.yaml --gpus tpu-v5-lite-podslice:4
Task from YAML spec: examples/tpu/tpuvm_mnist.yaml
Considered resources (1 node):
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Kubernetes 2CPU--8GB--4tpu-v5-lite-podslice 2 8 tpu-v5-lite-podslice:4 gke_skypilot-375900_us-south1-a_mix-tpu-test-txia 0.00 ✔
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-3ccc-txia'. Proceed? [Y/n]:
⚙︎ Launching on Kubernetes.
└── Pod is up.
✓ Cluster launched: sky-3ccc-txia. View logs at: ~/sky_logs/sky-2024-11-09-13-52-46-235927/provision.log
⚙︎ Running setup on 1 pod.
✓ Setup completed. View logs at: ~/sky_logs/sky-2024-11-09-13-52-46-235927/setup-*.log
⚙︎ Job submitted, ID: 1
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
✓ Job finished (status: SUCCEEDED).
Job ID: 1
📋 Useful Commands
├── To cancel the job: sky cancel sky-3ccc-txia 1
├── To stream job logs: sky logs sky-3ccc-txia 1
└── To view job queue: sky queue sky-3ccc-txia
Cluster name: sky-3ccc-txia
├── To log into the head VM: ssh sky-3ccc-txia
├── To submit a job: sky exec sky-3ccc-txia yaml_file
├── To stop the cluster: sky stop sky-3ccc-txia
└── To teardown the cluster: sky down sky-3ccc-txia
Tip: `sky down` will delete launched TPU(s) too.
$ sky launch --gpus tpu-v5-lite-podslice:2 -c sky-3ccc-txia 'conda activate flax; python -c "import jax; print(jax.devices())"'
Task from command: conda activate flax; python -c "import jax; print(jax.devices())"
Running task on cluster sky-3ccc-txia...
⚙︎ Launching on Kubernetes.
└── Pod is up.
✓ Cluster launched: sky-3ccc-txia. View logs at: ~/sky_logs/sky-2024-11-09-14-05-11-848087/provision.log
⚙︎ Job submitted, ID: 3
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-cmd, pid=34717) [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
✓ Job finished (status: SUCCEEDED).
Job ID: 3
📋 Useful Commands
├── To cancel the job: sky cancel sky-3ccc-txia 3
├── To stream job logs: sky logs sky-3ccc-txia 3
└── To view job queue: sky queue sky-3ccc-txia
Cluster name: sky-3ccc-txia
├── To log into the head VM: ssh sky-3ccc-txia
├── To submit a job: sky exec sky-3ccc-txia yaml_file
├── To stop the cluster: sky stop sky-3ccc-txia
└── To teardown the cluster: sky down sky-3ccc-txia
Tip: `sky down` will delete launched TPU(s) too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landscapepainter ! Tested it again and it works smoothly. After the above mentioned issue resolved it should be ready to go!
I think our current cloud TPUs also behave in the same way, so allowing |
One of our users requested a feature to use spot TPU from GKE. This is an intial step to support the request for on-demand single host TPU.
This PR does not contain the support for:
Tested (run the relevant ones):
bash format.sh
sky launch tpu_gke.yaml --cloud kubernetes --gpus tpu-v5-lite-podslice --num-nodes 2 -y
sky launch --cloud kubernetes --cpus=2 -y
sky show-gpus --cloud kubernetes
sky launch tpu_gke.yaml --cloud kubernetes --gpus tpu-v5-lite-podslice:4 -y
pytest tests/test_smoke.py
besides the ones also failing from master branch:pytest tests/test_smoke.py::test_fill_in_the_name
pytest tests/test_smoke.py::test_tpu_pod_slice_gke --kubernetes
conda deactivate; bash -i tests/backward_compatibility_tests.sh
tpu_gke.yaml
: