Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LAUNCH BLOCKER?] Runner-AOTI does not work with CUDA #709

Closed
mikekgfb opened this issue May 7, 2024 · 6 comments · Fixed by #815
Closed

[LAUNCH BLOCKER?] Runner-AOTI does not work with CUDA #709

mikekgfb opened this issue May 7, 2024 · 6 comments · Fixed by #815

Comments

@mikekgfb
Copy link
Contributor

mikekgfb commented May 7, 2024

The aoti runner appears to fail: https://github.com/pytorch/torchchat/actions/runs/8977233102/job/24655581239?pr=707 (I think it’s because it is looking to use CUDA?) while the macOS runner passes:
https://github.com/pytorch/torchchat/actions/runs/8977233063/job/24655581146?pr=707

  [222/222] Linking CXX executable aoti_run
  + printf 'Build finished. Please run: \n./cmake-out/aoti_run model.<pte|so> -z tokenizer.model -l <llama version (2 or 3)> -i <prompt>\n'
  Build finished. Please run: 
  ./cmake-out/aoti_run model.<pte|so> -z tokenizer.model -l <llama version (2 or 3)> -i <prompt>
  + cmake-out/aoti_run exportedModels/stories15M.so -z /root/.torchchat/model-cache/stories15M/tokenizer.model -l 2 -i 'Once upon a time'
  Tokenizer already initialized.
  Error: CUDA error: invalid argument
  terminate called after throwing an instance of 'std::runtime_error'
    what():  create_func_( &container_handle_, num_models, device_str.c_str(), cubin_dir.empty() ? nullptr : cubin_dir.c_str()) API call failed at ../torch/csrc/inductor/aoti_runner/model_container_runner.cpp, line 49
  ./we-run-this.sh: line 34:  1930 Aborted                 (core dumped) cmake-out/aoti_run exportedModels/stories15M.so -z ~/.torchchat/model-cache/stories15M/tokenizer.model -l 2 -i "Once upon a time"
  Traceback (most recent call last):
    File "/home/ec2-user/actions-runner/_work/torchchat/torchchat/test-infra/.github/scripts/run_with_env_secrets.py", line 100, in <module>
      main()
    File "/home/ec2-user/actions-runner/_work/torchchat/torchchat/test-infra/.github/scripts/run_with_env_secrets.py", line 96, in main
      run_cmd_or_die(f"docker exec -t {container_name} /exec")
    File "/home/ec2-user/actions-runner/_work/torchchat/torchchat/test-infra/.github/scripts/run_with_env_secrets.py", line 38, in run_cmd_or_die
      raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}")
  RuntimeError: Command docker exec -t 98701b3d65bdcf6908e66be9b03165721126de05e708c941f2d6e22963dfaeaf /exec failed with exit code 134
  Error: Process completed with exit code 1.
@mikekgfb mikekgfb changed the title Runner-AOTI does not work with CUDA [LAUNCH BLOCKER?] Runner-AOTI does not work with CUDA May 7, 2024
@mikekgfb
Copy link
Contributor Author

mikekgfb commented May 7, 2024

@ali-khosh @orionr @malfet Is non-functioning CUDA model execution with native C++ executor a launch blocker?

@ali-khosh
Copy link
Contributor

Don't know enough about cuda to chime in.

@mikekgfb
Copy link
Contributor Author

mikekgfb commented May 12, 2024

@malfet offered to look into this, calling this a launch blocker based on discussion and pending investigation outcome

@mikekgfb
Copy link
Contributor Author

In addition to whatever libraries, amy also need to do a .to("cuda") of inputs and .to("cpu") of outputs as part of runner. I wish we could do this inside the model itself, but the model tracing starts with tensors either on CPU or GPU, so there's no transfer in the execution stream of model.forward() to be included.

Why it would be neat? to have the transfers in the model? Otherwise we need to figure out what model type it is (CPU or GPU?) before we conditionally do a move to that device, and a move back from that device after the AOTI-compiled model.forward() returns.

cc: @bertmaher as subject matter expert and author of llama2.so

@bertmaher
Copy link

Okay, so I'm not sure what the right solution for this project is, but to work on CUDA the runner needs only a small diff (pasted below as text). Basically:

  1. Create an AOTIModelContainerRunnerCuda, and pass it a path to the cubin directory created by AOTI (I've hardcoded my /home/bertrand/local/torchchat, because of laziness)
  2. Move the inputs to the cuda device before calling the runner
diff --git a/runner/run.cpp b/runner/run.cpp
index e572bfe..2ae91f9 100644
--- a/runner/run.cpp
+++ b/runner/run.cpp
@@ -23,7 +23,7 @@
 #endif

 #ifdef __AOTI_MODEL__
-#include <torch/csrc/inductor/aoti_runner/model_container_runner_cpu.h>
+#include <torch/csrc/inductor/aoti_runner/model_container_runner_cuda.h>
 torch::Device cpu_device(torch::kCPU);

 #else // __ET_MODEL__
@@ -82,7 +82,7 @@ typedef struct {
   RunState state; // buffers for the "wave" of activations in the forward pass

 #ifdef __AOTI_MODEL__
-  torch::inductor::AOTIModelContainerRunnerCpu* runner;
+  torch::inductor::AOTIModelContainerRunner* runner;
 #else // __ET_MODEL__
   Module* runner;
 #endif
@@ -132,9 +132,12 @@ void build_transformer(
   malloc_run_state(&t->state, &t->config);

 #ifdef __AOTI_MODEL__
-  t->runner = new torch::inductor::AOTIModelContainerRunnerCpu(
+  t->runner = new torch::inductor::AOTIModelContainerRunnerCuda(
       /* path to model DSO */ model_path,
-      /* thread pool size  */ 1);
+      /* thread pool size  */ 1,
+      "cuda",
+      "/home/bertrand/local/torchchat"
+  );
 #else //__ET_MODEL__
   t->runner = new Module(
       /* path to PTE model */ model_path,
@@ -186,7 +189,7 @@ float* forward(Transformer* transformer, int token, int pos) {
   torch::Tensor token_tensor =
       torch::from_blob(token_buffer, {1, 1}, torch::kLong);
   torch::Tensor pos_tensor = torch::from_blob(pos_buffer, {1}, torch::kLong);
-  std::vector<torch::Tensor> inputs{token_tensor, pos_tensor};
+  std::vector<torch::Tensor> inputs{token_tensor.to(torch::kCUDA), pos_tensor.to(torch::kCUDA)};

   torch::Tensor result = transformer->runner->run(inputs)[0]
                              .to(torch::dtype(torch::kFloat32))

@mikekgfb
Copy link
Contributor Author

This is also the root cause for #707

malfet added a commit that referenced this issue May 18, 2024
…815)

By wrapping attempt to load a model with `try {} catch (std::runtime_error) {}` and attempting to create model on GPU first, as attempt to load CPU model on CUDA destroys CUDA context (bugs/fixes againt PyTorch are coming, tracked in pytorch/pytorch#126547 )

Also, fix two bugs in the repo:
 - Initialize `Tokenizer::initialized_` to false
 - Change name of the tokenizer file in a workflow from `tokenizer.bin` to `tokenizer.model`


Fixes #709

Test plan:
```
python3 torchchat.py export --checkpoint-path checkpoints/stories15M/model.pth --output-dso-path model_cpu.so --device cpu
python3 torchchat.py export --checkpoint-path checkpoints/stories15M/model.pth --output-dso-path model.so
./cmake-out/aoti_run ./model.so -z checkpoints/stories15M/tokenizer.model
./cmake-out/aoti_run ./model_cpu.so -z checkpoints/stories15M/tokenizer.model
```
malfet added a commit that referenced this issue Jul 17, 2024
…815)

By wrapping attempt to load a model with `try {} catch (std::runtime_error) {}` and attempting to create model on GPU first, as attempt to load CPU model on CUDA destroys CUDA context (bugs/fixes againt PyTorch are coming, tracked in pytorch/pytorch#126547 )

Also, fix two bugs in the repo:
 - Initialize `Tokenizer::initialized_` to false
 - Change name of the tokenizer file in a workflow from `tokenizer.bin` to `tokenizer.model`


Fixes #709

Test plan:
```
python3 torchchat.py export --checkpoint-path checkpoints/stories15M/model.pth --output-dso-path model_cpu.so --device cpu
python3 torchchat.py export --checkpoint-path checkpoints/stories15M/model.pth --output-dso-path model.so
./cmake-out/aoti_run ./model.so -z checkpoints/stories15M/tokenizer.model
./cmake-out/aoti_run ./model_cpu.so -z checkpoints/stories15M/tokenizer.model
```
malfet added a commit that referenced this issue Jul 17, 2024
…815)

By wrapping attempt to load a model with `try {} catch (std::runtime_error) {}` and attempting to create model on GPU first, as attempt to load CPU model on CUDA destroys CUDA context (bugs/fixes againt PyTorch are coming, tracked in pytorch/pytorch#126547 )

Also, fix two bugs in the repo:
 - Initialize `Tokenizer::initialized_` to false
 - Change name of the tokenizer file in a workflow from `tokenizer.bin` to `tokenizer.model`


Fixes #709

Test plan:
```
python3 torchchat.py export --checkpoint-path checkpoints/stories15M/model.pth --output-dso-path model_cpu.so --device cpu
python3 torchchat.py export --checkpoint-path checkpoints/stories15M/model.pth --output-dso-path model.so
./cmake-out/aoti_run ./model.so -z checkpoints/stories15M/tokenizer.model
./cmake-out/aoti_run ./model_cpu.so -z checkpoints/stories15M/tokenizer.model
```
malfet added a commit that referenced this issue Jul 17, 2024
…815)

By wrapping attempt to load a model with `try {} catch (std::runtime_error) {}` and attempting to create model on GPU first, as attempt to load CPU model on CUDA destroys CUDA context (bugs/fixes againt PyTorch are coming, tracked in pytorch/pytorch#126547 )

Also, fix two bugs in the repo:
 - Initialize `Tokenizer::initialized_` to false
 - Change name of the tokenizer file in a workflow from `tokenizer.bin` to `tokenizer.model`


Fixes #709

Test plan:
```
python3 torchchat.py export --checkpoint-path checkpoints/stories15M/model.pth --output-dso-path model_cpu.so --device cpu
python3 torchchat.py export --checkpoint-path checkpoints/stories15M/model.pth --output-dso-path model.so
./cmake-out/aoti_run ./model.so -z checkpoints/stories15M/tokenizer.model
./cmake-out/aoti_run ./model_cpu.so -z checkpoints/stories15M/tokenizer.model
```
malfet added a commit that referenced this issue Jul 17, 2024
…815)

By wrapping attempt to load a model with `try {} catch (std::runtime_error) {}` and attempting to create model on GPU first, as attempt to load CPU model on CUDA destroys CUDA context (bugs/fixes againt PyTorch are coming, tracked in pytorch/pytorch#126547 )

Also, fix two bugs in the repo:
 - Initialize `Tokenizer::initialized_` to false
 - Change name of the tokenizer file in a workflow from `tokenizer.bin` to `tokenizer.model`


Fixes #709

Test plan:
```
python3 torchchat.py export --checkpoint-path checkpoints/stories15M/model.pth --output-dso-path model_cpu.so --device cpu
python3 torchchat.py export --checkpoint-path checkpoints/stories15M/model.pth --output-dso-path model.so
./cmake-out/aoti_run ./model.so -z checkpoints/stories15M/tokenizer.model
./cmake-out/aoti_run ./model_cpu.so -z checkpoints/stories15M/tokenizer.model
```
malfet added a commit that referenced this issue Jul 17, 2024
…815)

By wrapping attempt to load a model with `try {} catch (std::runtime_error) {}` and attempting to create model on GPU first, as attempt to load CPU model on CUDA destroys CUDA context (bugs/fixes againt PyTorch are coming, tracked in pytorch/pytorch#126547 )

Also, fix two bugs in the repo:
 - Initialize `Tokenizer::initialized_` to false
 - Change name of the tokenizer file in a workflow from `tokenizer.bin` to `tokenizer.model`


Fixes #709

Test plan:
```
python3 torchchat.py export --checkpoint-path checkpoints/stories15M/model.pth --output-dso-path model_cpu.so --device cpu
python3 torchchat.py export --checkpoint-path checkpoints/stories15M/model.pth --output-dso-path model.so
./cmake-out/aoti_run ./model.so -z checkpoints/stories15M/tokenizer.model
./cmake-out/aoti_run ./model_cpu.so -z checkpoints/stories15M/tokenizer.model
```
malfet added a commit that referenced this issue Jul 17, 2024
…815)

By wrapping attempt to load a model with `try {} catch (std::runtime_error) {}` and attempting to create model on GPU first, as attempt to load CPU model on CUDA destroys CUDA context (bugs/fixes againt PyTorch are coming, tracked in pytorch/pytorch#126547 )

Also, fix two bugs in the repo:
 - Initialize `Tokenizer::initialized_` to false
 - Change name of the tokenizer file in a workflow from `tokenizer.bin` to `tokenizer.model`


Fixes #709

Test plan:
```
python3 torchchat.py export --checkpoint-path checkpoints/stories15M/model.pth --output-dso-path model_cpu.so --device cpu
python3 torchchat.py export --checkpoint-path checkpoints/stories15M/model.pth --output-dso-path model.so
./cmake-out/aoti_run ./model.so -z checkpoints/stories15M/tokenizer.model
./cmake-out/aoti_run ./model_cpu.so -z checkpoints/stories15M/tokenizer.model
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants