Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful termination on out of memory #406

Closed
avik-pal opened this issue Dec 20, 2024 · 2 comments
Closed

Graceful termination on out of memory #406

avik-pal opened this issue Dec 20, 2024 · 2 comments

Comments

@avik-pal
Copy link
Collaborator

This particular case should go OOM, but I would expect an error on Reactant end, rather than crashing my julia session:

2024-12-20 12:16:23.529722: W external/xla/xla/tsl/framework/bfc_allocator.cc:512] *********___________________________________________________________________________________________
E1220 12:16:23.529862  741040 pjrt_stream_executor_client.cc:3086] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 4306723224 bytes.
terminate called after throwing an instance of 'xla::XlaRuntimeError'
  what():  RESOURCE_EXHAUSTED: Out of memory while trying to allocate 4306723224 bytes.

[741040] signal 6 (-6): Aborted
in expression starting at /mnt/software/lux/Lux.jl/examples/ConvMixer/main.jl:76
unknown function (ip: 0x70dc8f2703f4)
gsignal at /usr/lib/libc.so.6 (unknown line)
abort at /usr/lib/libc.so.6 (unknown line)
__verbose_terminate_handler at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
__terminate at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
terminate at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58
__cxa_throw at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_throw.cc:98
_ZN3xla12ValueOrThrowISt6vectorIS1_ISt10unique_ptrINS_10PjRtBufferESt14default_deleteIS3_EESaIS6_EESaIS8_EEEET_N4absl12lts_202308028StatusOrISB_EE at /mnt/.julia/artifacts/a7d008f5ba52e8657b34453e594bdc0e79c1f11d/lib/libReactantExtra.so (unknown line)
XLAExecute at /mnt/.julia/artifacts/a7d008f5ba52e8657b34453e594bdc0e79c1f11d/lib/libReactantExtra.so (unknown line)
macro expansion at /mnt/software/lux/Reactant.jl/src/XLA.jl:338 [inlined]
ExecutableCall at /mnt/software/lux/Reactant.jl/src/XLA.jl:315 [inlined]
macro expansion at /mnt/software/lux/Reactant.jl/src/Compiler.jl:694 [inlined]
Thunk at /mnt/software/lux/Reactant.jl/src/Compiler.jl:805
unknown function (ip: 0x70dafdea39c6)
single_train_step_impl! at /mnt/software/lux/Lux.jl/ext/LuxReactantExt/training.jl:43
single_train_step! at /mnt/software/lux/Lux.jl/src/helpers/training.jl:276
unknown function (ip: 0x70dafe98ec4e)
#main#11 at /mnt/software/lux/Lux.jl/examples/ConvMixer/main.jl:130
main at /mnt/software/lux/Lux.jl/examples/ConvMixer/main.jl:76 [inlined]
command_main at /mnt/.julia/packages/Comonicon/F3QqZ/src/codegen/julia.jl:343
command_main at /mnt/.julia/packages/Comonicon/F3QqZ/src/codegen/julia.jl:90
unknown function (ip: 0x70dafe3c1c6f)
jl_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_call at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/interpreter.c:126
eval_value at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/interpreter.c:223
eval_stmt_value at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/interpreter.c:174 [inlined]
eval_body at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/interpreter.c:663
jl_interpret_toplevel_thunk at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/interpreter.c:821
jl_toplevel_eval_flex at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/toplevel.c:943
jl_toplevel_eval_flex at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/toplevel.c:886
ijl_toplevel_eval_in at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/toplevel.c:994
eval at ./boot.jl:430 [inlined]
include_string at ./loading.jl:2734
jl_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
jl_f__call_latest at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/builtins.c:875
jl_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/builtins.c:831
#invokelatest#2 at ./essentials.jl:1055
jl_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/builtins.c:831
invokelatest at ./essentials.jl:1052
jl_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/builtins.c:831
#inlineeval#76 at /home/avikpal/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/VSCodeServer/src/eval.jl:271
inlineeval at /home/avikpal/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/VSCodeServer/src/eval.jl:268
#69 at /home/avikpal/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/VSCodeServer/src/eval.jl:181
withpath at /home/avikpal/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/VSCodeServer/src/repl.jl:276
#68 at /home/avikpal/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/VSCodeServer/src/eval.jl:179
hideprompt at /home/avikpal/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/VSCodeServer/src/repl.jl:38
#67 at /home/avikpal/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/VSCodeServer/src/eval.jl:150 [inlined]
with_logstate at ./logging/logging.jl:522
with_logger at ./logging/logging.jl:632 [inlined]
#66 at /home/avikpal/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/VSCodeServer/src/eval.jl:263
unknown function (ip: 0x70dafa9ac42f)
jl_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
jl_f__call_latest at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/builtins.c:875
#invokelatest#2 at ./essentials.jl:1055 [inlined]
invokelatest at ./essentials.jl:1052
unknown function (ip: 0x70dc022fd822)
jl_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/builtins.c:831
#64 at /home/avikpal/.vscode/extensions/julialang.language-julia-1.127.2/scripts/packages/VSCodeServer/src/eval.jl:34
unknown function (ip: 0x70dc023484ff)
jl_apply at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
start_task at /cache/build/tester-amdci5-12/julialang/julia-release-1-dot-11/src/task.c:1202
Allocations: 533298938 (Pool: 533284064; Big: 14874); GC: 287
[1]    741040 IOT instruction (core dumped)  julia --project=examples/ConvMixer --threads=auto --check-bounds=yes
@mofeing
Copy link
Collaborator

mofeing commented Dec 20, 2024

xref #380

@wsmoses
Copy link
Member

wsmoses commented Dec 27, 2024

resolved on main

@wsmoses wsmoses closed this as completed Dec 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants