[LAUNCH BLOCKER?] Runner-AOTI does not work with CUDA #709

mikekgfb · 2024-05-07T00:57:34Z

The aoti runner appears to fail: https://github.com/pytorch/torchchat/actions/runs/8977233102/job/24655581239?pr=707 (I think it’s because it is looking to use CUDA?) while the macOS runner passes:
https://github.com/pytorch/torchchat/actions/runs/8977233063/job/24655581146?pr=707

  [222/222] Linking CXX executable aoti_run
  + printf 'Build finished. Please run: \n./cmake-out/aoti_run model.<pte|so> -z tokenizer.model -l <llama version (2 or 3)> -i <prompt>\n'
  Build finished. Please run: 
  ./cmake-out/aoti_run model.<pte|so> -z tokenizer.model -l <llama version (2 or 3)> -i <prompt>
  + cmake-out/aoti_run exportedModels/stories15M.so -z /root/.torchchat/model-cache/stories15M/tokenizer.model -l 2 -i 'Once upon a time'
  Tokenizer already initialized.
  Error: CUDA error: invalid argument
  terminate called after throwing an instance of 'std::runtime_error'
    what():  create_func_( &container_handle_, num_models, device_str.c_str(), cubin_dir.empty() ? nullptr : cubin_dir.c_str()) API call failed at ../torch/csrc/inductor/aoti_runner/model_container_runner.cpp, line 49
  ./we-run-this.sh: line 34:  1930 Aborted                 (core dumped) cmake-out/aoti_run exportedModels/stories15M.so -z ~/.torchchat/model-cache/stories15M/tokenizer.model -l 2 -i "Once upon a time"
  Traceback (most recent call last):
    File "/home/ec2-user/actions-runner/_work/torchchat/torchchat/test-infra/.github/scripts/run_with_env_secrets.py", line 100, in <module>
      main()
    File "/home/ec2-user/actions-runner/_work/torchchat/torchchat/test-infra/.github/scripts/run_with_env_secrets.py", line 96, in main
      run_cmd_or_die(f"docker exec -t {container_name} /exec")
    File "/home/ec2-user/actions-runner/_work/torchchat/torchchat/test-infra/.github/scripts/run_with_env_secrets.py", line 38, in run_cmd_or_die
      raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}")
  RuntimeError: Command docker exec -t 98701b3d65bdcf6908e66be9b03165721126de05e708c941f2d6e22963dfaeaf /exec failed with exit code 134
  Error: Process completed with exit code 1.

The text was updated successfully, but these errors were encountered:

mikekgfb · 2024-05-07T06:53:50Z

@ali-khosh @orionr @malfet Is non-functioning CUDA model execution with native C++ executor a launch blocker?

ali-khosh · 2024-05-08T18:30:51Z

Don't know enough about cuda to chime in.

mikekgfb · 2024-05-12T21:24:14Z

@malfet offered to look into this, calling this a launch blocker based on discussion and pending investigation outcome

mikekgfb · 2024-05-12T22:29:52Z

In addition to whatever libraries, amy also need to do a .to("cuda") of inputs and .to("cpu") of outputs as part of runner. I wish we could do this inside the model itself, but the model tracing starts with tensors either on CPU or GPU, so there's no transfer in the execution stream of model.forward() to be included.

Why it would be neat? to have the transfers in the model? Otherwise we need to figure out what model type it is (CPU or GPU?) before we conditionally do a move to that device, and a move back from that device after the AOTI-compiled model.forward() returns.

cc: @bertmaher as subject matter expert and author of llama2.so

bertmaher · 2024-05-13T16:35:38Z

Okay, so I'm not sure what the right solution for this project is, but to work on CUDA the runner needs only a small diff (pasted below as text). Basically:

Create an AOTIModelContainerRunnerCuda, and pass it a path to the cubin directory created by AOTI (I've hardcoded my /home/bertrand/local/torchchat, because of laziness)
Move the inputs to the cuda device before calling the runner

diff --git a/runner/run.cpp b/runner/run.cpp
index e572bfe..2ae91f9 100644
--- a/runner/run.cpp
+++ b/runner/run.cpp
@@ -23,7 +23,7 @@
 #endif

 #ifdef __AOTI_MODEL__
-#include <torch/csrc/inductor/aoti_runner/model_container_runner_cpu.h>
+#include <torch/csrc/inductor/aoti_runner/model_container_runner_cuda.h>
 torch::Device cpu_device(torch::kCPU);

 #else // __ET_MODEL__
@@ -82,7 +82,7 @@ typedef struct {
   RunState state; // buffers for the "wave" of activations in the forward pass

 #ifdef __AOTI_MODEL__
-  torch::inductor::AOTIModelContainerRunnerCpu* runner;
+  torch::inductor::AOTIModelContainerRunner* runner;
 #else // __ET_MODEL__
   Module* runner;
 #endif
@@ -132,9 +132,12 @@ void build_transformer(
   malloc_run_state(&t->state, &t->config);

 #ifdef __AOTI_MODEL__
-  t->runner = new torch::inductor::AOTIModelContainerRunnerCpu(
+  t->runner = new torch::inductor::AOTIModelContainerRunnerCuda(
       /* path to model DSO */ model_path,
-      /* thread pool size  */ 1);
+      /* thread pool size  */ 1,
+      "cuda",
+      "/home/bertrand/local/torchchat"
+  );
 #else //__ET_MODEL__
   t->runner = new Module(
       /* path to PTE model */ model_path,
@@ -186,7 +189,7 @@ float* forward(Transformer* transformer, int token, int pos) {
   torch::Tensor token_tensor =
       torch::from_blob(token_buffer, {1, 1}, torch::kLong);
   torch::Tensor pos_tensor = torch::from_blob(pos_buffer, {1}, torch::kLong);
-  std::vector<torch::Tensor> inputs{token_tensor, pos_tensor};
+  std::vector<torch::Tensor> inputs{token_tensor.to(torch::kCUDA), pos_tensor.to(torch::kCUDA)};

   torch::Tensor result = transformer->runner->run(inputs)[0]
                              .to(torch::dtype(torch::kFloat32))

mikekgfb · 2024-05-13T20:20:24Z

This is also the root cause for #707

…815) By wrapping attempt to load a model with `try {} catch (std::runtime_error) {}` and attempting to create model on GPU first, as attempt to load CPU model on CUDA destroys CUDA context (bugs/fixes againt PyTorch are coming, tracked in pytorch/pytorch#126547 ) Also, fix two bugs in the repo: - Initialize `Tokenizer::initialized_` to false - Change name of the tokenizer file in a workflow from `tokenizer.bin` to `tokenizer.model` Fixes #709 Test plan: ``` python3 torchchat.py export --checkpoint-path checkpoints/stories15M/model.pth --output-dso-path model_cpu.so --device cpu python3 torchchat.py export --checkpoint-path checkpoints/stories15M/model.pth --output-dso-path model.so ./cmake-out/aoti_run ./model.so -z checkpoints/stories15M/tokenizer.model ./cmake-out/aoti_run ./model_cpu.so -z checkpoints/stories15M/tokenizer.model ```

mikekgfb changed the title ~~Runner-AOTI does not work with CUDA~~ [LAUNCH BLOCKER?] Runner-AOTI does not work with CUDA May 7, 2024

mikekgfb mentioned this issue May 7, 2024

[LAUNCH BLOCKER] Cannot run aoti runner with stories15M #703

Closed

mikekgfb added the launch blocker label May 12, 2024

mikekgfb mentioned this issue May 12, 2024

[LAUNCH BLOCKER] executable README tests: instructions do not correctly launch model with AOTI runner #678

Closed

mikekgfb mentioned this issue May 12, 2024

Make AOTI runner work (and tested) for AOTI-CUDA compiled models #518

Closed

malfet mentioned this issue May 17, 2024

Make it possible to load both CPU and CUDA models using same runner #815

Merged

malfet closed this as completed in #815 May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LAUNCH BLOCKER?] Runner-AOTI does not work with CUDA #709

[LAUNCH BLOCKER?] Runner-AOTI does not work with CUDA #709

mikekgfb commented May 7, 2024

mikekgfb commented May 7, 2024

ali-khosh commented May 8, 2024

mikekgfb commented May 12, 2024 •

edited

Loading

mikekgfb commented May 12, 2024

bertmaher commented May 13, 2024

mikekgfb commented May 13, 2024

[LAUNCH BLOCKER?] Runner-AOTI does not work with CUDA #709

[LAUNCH BLOCKER?] Runner-AOTI does not work with CUDA #709

Comments

mikekgfb commented May 7, 2024

mikekgfb commented May 7, 2024

ali-khosh commented May 8, 2024

mikekgfb commented May 12, 2024 • edited Loading

mikekgfb commented May 12, 2024

bertmaher commented May 13, 2024

mikekgfb commented May 13, 2024

mikekgfb commented May 12, 2024 •

edited

Loading