Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in array initialisation #305

Closed
luraess opened this issue Oct 5, 2022 · 25 comments
Closed

Segfault in array initialisation #305

luraess opened this issue Oct 5, 2022 · 25 comments
Labels
bug Something isn't working

Comments

@luraess
Copy link
Contributor

luraess commented Oct 5, 2022

Using AMDGPU 0.4.3 segfaults upon array initialisation AMDGPU.ones(Float64,2,2) or ROCArray(ones(2,2)). Also, it is unclear to me why it now uses hipStreamSynchronize. (@jpsamaroo or @vchuravy do you have any insights on what's going on here?)

This occurs using Julia 1.8.2 and ROCm 4.3 on a system with Vega20 (gfx906) cards using artifacts.

It looks like that JULIA_AMDGPU_DISABLE_ARTIFACTS has no longer any effect.

Testing with AMDGPU 0.4.2 all works fine, and also env var to disable artifacts works.

julia> AMDGPU.versioninfo()
HSA Runtime (ready)
- Path: /users/lraess/.julia/artifacts/20030eaff1f8f47d3646fc99d415a823516778d7/lib/libhsa-runtime64.so
- Version: 1.1.0
ld.lld (ready)
- Path: /users/lraess/.julia/artifacts/c86785d1da1b021c7790274eb700100581e341a5/tools/lld
ROCm-Device-Libs (ready)
- Path: /users/lraess/.julia/artifacts/2a4556d4ad40fc77472f80ad8a090d3ffea854bb/amdgcn/bitcode
HIP Runtime (ready)
- Path: /users/lraess/.julia/artifacts/5c755583cc986bc9e32bd640cdb0045f19a094bd/hip/lib/libamdhip64.so
rocBLAS (ready)
- Path: /users/lraess/.julia/artifacts/1fd1782850f6482c5bdbeec69cb3d1e43d949a3c/lib/librocblas.so
rocSOLVER (MISSING)
rocALUTION (MISSING)
rocSPARSE (ready)
- Path: /users/lraess/.julia/artifacts/edcc84addc716a5f1f4ee7af62643e616ce23886/rocsparse/lib/librocsparse.so
rocRAND (ready)
- Path: /users/lraess/.julia/artifacts/6f32e80b256ed82bae6982288b859e244a36479d/lib/librocrand.so
rocFFT (MISSING)
MIOpen (MISSING)
HSA Agents (6):
- CPU-XX [AMD EPYC 7742 64-Core Processor]
- CPU-XX [AMD EPYC 7742 64-Core Processor]
- GPU-c616498172e626c4 [gfx906]
- GPU-ebae40e172e620f6 [gfx906]
- GPU-9ab2894172df8896 [gfx906]
- GPU-3f28890172e620f4 [gfx906]

Here errors using AMDGPU.ones(Float64,2,2):

julia> AMDGPU.ones(Float64,2,2)
free(): invalid size

signal (6): Aborted
in expression starting at REPL[3]:1
gsignal at /lib64/libc.so.6 (unknown line)
abort at /lib64/libc.so.6 (unknown line)
__libc_message at /lib64/libc.so.6 (unknown line)
malloc_printerr at /lib64/libc.so.6 (unknown line)
_int_free at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x155111adbd69)
unknown function (ip: 0x155111adbed8)
unknown function (ip: 0x155111add075)
unknown function (ip: 0x155111a3f65b)
unknown function (ip: 0x155111ad42ee)
unknown function (ip: 0x1551117b6af8)
hipStreamSynchronize at /users/lraess/.julia/artifacts/5c755583cc986bc9e32bd640cdb0045f19a094bd/hip/lib/libamdhip64.so (unknown line)
macro expansion at /users/lraess/.julia/packages/AMDGPU/8XHd9/src/hip/error.jl:149 [inlined]
hipStreamSynchronize at /users/lraess/.julia/packages/AMDGPU/8XHd9/src/hip/libhip.jl:2
wait! at /users/lraess/.julia/packages/AMDGPU/8XHd9/src/sync.jl:22
wait! at /users/lraess/.julia/packages/AMDGPU/8XHd9/src/array.jl:88 [inlined]
#57 at ./tuple.jl:555 [inlined]
BottomRF at ./reduce.jl:81 [inlined]
afoldl at ./operators.jl:549 [inlined]
_foldl_impl at ./tuple.jl:277 [inlined]
foldl_impl at ./reduce.jl:48 [inlined]
mapfoldl_impl at ./reduce.jl:44 [inlined]
#mapfoldl#259 at ./reduce.jl:170 [inlined]
mapfoldl##kw at ./reduce.jl:170 [inlined]
#foldl#260 at ./reduce.jl:193 [inlined]
foldl##kw at ./reduce.jl:193 [inlined]
foreach at ./tuple.jl:555 [inlined]
macro expansion at /users/lraess/.julia/packages/AMDGPU/8XHd9/src/highlevel.jl:372 [inlined]
#gpu_call#57 at /users/lraess/.julia/packages/AMDGPU/8XHd9/src/array.jl:14
gpu_call##kw at /users/lraess/.julia/packages/AMDGPU/8XHd9/src/array.jl:11 [inlined]
#gpu_call#1 at /users/lraess/.julia/packages/GPUArrays/fqD8z/src/device/execution.jl:65 [inlined]
gpu_call at /users/lraess/.julia/packages/GPUArrays/fqD8z/src/device/execution.jl:34 [inlined]
fill! at /users/lraess/.julia/packages/GPUArrays/fqD8z/src/host/construction.jl:14
unknown function (ip: 0x1551310e4e8a)
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gf.c:2549
ones at /users/lraess/.julia/packages/AMDGPU/8XHd9/src/array.jl:397
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/julia.h:1839 [inlined]
do_call at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/interpreter.c:126
eval_value at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/interpreter.c:215
eval_stmt_value at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/interpreter.c:166 [inlined]
eval_body at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/interpreter.c:612
jl_interpret_toplevel_thunk at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/interpreter.c:750
jl_toplevel_eval_flex at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/toplevel.c:906
jl_toplevel_eval_flex at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/toplevel.c:850
ijl_toplevel_eval_in at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/toplevel.c:965
eval at ./boot.jl:368 [inlined]
eval_user_input at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:151
repl_backend_loop at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:247
start_repl_backend at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:232
#run_repl#47 at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:369
run_repl at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/REPL/src/REPL.jl:355
jfptr_run_repl_64841.clone_1 at /users/lraess/julia_local/julia-1.8.2/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gf.c:2549
#967 at ./client.jl:419
jfptr_YY.967_30403.clone_1 at /users/lraess/julia_local/julia-1.8.2/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/julia.h:1839 [inlined]
jl_f__call_latest at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/builtins.c:774
#invokelatest#2 at ./essentials.jl:729 [inlined]
invokelatest at ./essentials.jl:726 [inlined]
run_main_repl at ./client.jl:404
exec_options at ./client.jl:318
_start at ./client.jl:522
jfptr__start_56736.clone_1 at /users/lraess/julia_local/julia-1.8.2/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gf.c:2367 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gf.c:2549
jl_apply at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/julia.h:1839 [inlined]
true_main at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/jlapi.c:575
jl_repl_entrypoint at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/jlapi.c:719
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x401098)
Allocations: 40823942 (Pool: 40806986; Big: 16956); GC: 41
/scratch/lraess/dev/ROCm-MPI/scripts/./runme.sh: line 7: 760425 Aborted                 (core dumped) julia --project
srun: error: ault20: task 0: Exited with exit code 134

And here using ROCArray(ones(2,2)):

julia> ROCArray(ones(2,2))
2×2 ROCMatrix{Float64}:
free(): double free detected in tcache 2

signal (6): Aborted
in expression starting at none:0
gsignal at /lib64/libc.so.6 (unknown line)
abort at /lib64/libc.so.6 (unknown line)
__libc_message at /lib64/libc.so.6 (unknown line)
malloc_printerr at /lib64/libc.so.6 (unknown line)
_int_free at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x155111a3f7f2)
unknown function (ip: 0x155111adbed8)
unknown function (ip: 0x155111add075)
unknown function (ip: 0x155111a3f65b)
unknown function (ip: 0x155111ad42ee)
unknown function (ip: 0x1551117b6af8)
hipStreamSynchronize at /users/lraess/.julia/artifacts/5c755583cc986bc9e32bd640cdb0045f19a094bd/hip/lib/libamdhip64.so (unknown line)
macro expansion at /users/lraess/.julia/packages/AMDGPU/8XHd9/src/hip/error.jl:149 [inlined]
hipStreamSynchronize at /users/lraess/.julia/packages/AMDGPU/8XHd9/src/hip/libhip.jl:2
malloc(): smallbin double linked list corrupted
/scratch/lraess/dev/ROCm-MPI/scripts/./runme.sh: line 7: 761707 Aborted                 (core dumped) julia --project
srun: error: ault20: task 0: Exited with exit code 134
@luraess luraess added the bug Something isn't working label Oct 5, 2022
@pxl-th
Copy link
Member

pxl-th commented Oct 5, 2022

Regarding artifacts on 0.4.3, can you post output of ]st?
I suspect there will be artifacts of 5.2.3 version (at least some of them).

@luraess
Copy link
Contributor Author

luraess commented Oct 5, 2022

  [21141c5a] AMDGPU v0.4.3
  [4d7a3746] ImplicitGlobalGrid v0.12.0 `https://github.com/luraess/ImplicitGlobalGrid.jl#lr/amdgpu-0.4.x-support`
  [3da0fdf6] MPIPreferences v0.1.5
  [91a5bcdd] Plots v1.35.2

@pxl-th
Copy link
Member

pxl-th commented Oct 5, 2022

Ah, versions for artifacts are show only when you are in the AMDGPU.jl project (you can ]dev AMDGPU for example), like so:

pxl-th@Yotun:~/.julia/dev/AMDGPU$ HSA_OVERRIDE_GFX_VERSION=10.3.0 julia -t8 --project=.
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.9.0-DEV.1437 (2022-09-26)
 _/ |\__'_|_|_|\__'_|  |  Commit 26304f763cf (8 days old master)
|__/                   |

(AMDGPU) pkg> st
Project AMDGPU v0.4.2
Status `~/.julia/dev/AMDGPU/Project.toml`
  [621f4979] AbstractFFTs v1.2.1
  [79e6a3ab] Adapt v3.4.0
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.2
  [f68482b8] Cthulhu v2.7.3
  [e2ba6199] ExprTools v0.1.8
  [0c68f7d7] GPUArrays v8.5.0
  [61eb1bfa] GPUCompiler v0.16.4
  [929cbde3] LLVM v4.14.0
⌃ [1914dd2f] MacroTools v0.5.9
  [21216c6a] Preferences v1.3.0
⌃ [efd6af41] ProfileCanvas v0.1.4
  [ae029012] Requires v1.3.0
  [efcf1570] Setfield v1.1.1
  [276daf66] SpecialFunctions v2.1.7
  [2696aab5] HIP_jll v5.2.3+1
  [d55e3150] LLD_jll v14.0.6+0
  [86de99a1] LLVM_jll v14.0.6+0
  [873c0968] ROCmDeviceLibs_jll v5.2.3+0
  [dd59ff1a] hsa_rocr_jll v5.2.3+0
  [1ef8cab2] rocBLAS_jll v5.2.3+2 `~/.julia/dev/rocBLAS_jll`
  [a6151927] rocRAND_jll v5.2.3+0
  [8c6ce2ba] rocSPARSE_jll v5.2.3+0
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [44cfe95a] Pkg v1.8.0
  [de0858da] Printf
  [9a3f8284] Random
  [10745b16] Statistics

As for why JULIA_AMDGPU_DISABLE_ARTIFACTS does not work, I'd try checking that this line is false, but you'd need to ]dev AMDGPU.

@luraess
Copy link
Contributor Author

luraess commented Oct 5, 2022

Getting

(AMDGPU) pkg> st
Project AMDGPU v0.4.3
Status `/scratch/lraess/dev/AMDGPU.jl/Project.toml`
  [621f4979] AbstractFFTs v1.2.1
  [79e6a3ab] Adapt v3.4.0
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.2
  [e2ba6199] ExprTools v0.1.8
  [0c68f7d7] GPUArrays v8.5.0
  [61eb1bfa] GPUCompiler v0.16.4
→ [929cbde3] LLVM v4.14.0 `https://github.com/maleadt/LLVM.jl.git#master`
  [1914dd2f] MacroTools v0.5.9
  [21216c6a] Preferences v1.3.0
  [ae029012] Requires v1.3.0
  [efcf1570] Setfield v1.1.1
  [276daf66] SpecialFunctions v2.1.7
 ⌅ [2696aab5] HIP_jll v4.2.0+2
 ⌅ [d55e3150] LLD_jll v12.0.0
 ⌅ [86de99a1] LLVM_jll v12.0.1+4
 ⌅ [873c0968] ROCmDeviceLibs_jll v4.2.0+1
 ⌅ [dd59ff1a] hsa_rocr_jll v4.2.0+2
→⌃ [1ef8cab2] rocBLAS_jll v4.2.0+0
 ⌃ [a6151927] rocRAND_jll v4.2.0+0
→⌃ [8c6ce2ba] rocSPARSE_jll v4.2.0+0
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [44cfe95a] Pkg
  [de0858da] Printf
  [9a3f8284] Random
  [10745b16] Statistics
Info Packages marked with → are not downloaded, use `instantiate` to download
Info Packages marked with ⌃ and ⌅ have new versions available, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`

@pxl-th
Copy link
Member

pxl-th commented Oct 5, 2022

You have artifacts of version 4.2.0, which is for Julia 1.7.
Try updating them to 4.5.2 since you are on 1.8 (make sure they are not 5.2.3 which is for 1.9).

@luraess
Copy link
Contributor Author

luraess commented Oct 5, 2022

My bad, I did not instantiate the project. Here if I instantiate and update:

(AMDGPU) pkg> st
Project AMDGPU v0.4.3
Status `/scratch/lraess/dev/AMDGPU.jl/Project.toml`
  [621f4979] AbstractFFTs v1.2.1
  [79e6a3ab] Adapt v3.4.0
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.2
  [e2ba6199] ExprTools v0.1.8
  [0c68f7d7] GPUArrays v8.5.0
  [61eb1bfa] GPUCompiler v0.16.4
  [929cbde3] LLVM v4.14.0 `https://github.com/maleadt/LLVM.jl.git#master`
  [1914dd2f] MacroTools v0.5.10
  [21216c6a] Preferences v1.3.0
  [ae029012] Requires v1.3.0
  [efcf1570] Setfield v1.1.1
  [276daf66] SpecialFunctions v2.1.7
  [2696aab5] HIP_jll v5.2.3+1
⌅ [d55e3150] LLD_jll v12.0.0
⌅ [86de99a1] LLVM_jll v13.0.1+3
⌅ [873c0968] ROCmDeviceLibs_jll v4.5.2+0
⌅ [dd59ff1a] hsa_rocr_jll v4.5.2+1
  [1ef8cab2] rocBLAS_jll v5.2.3+1
  [a6151927] rocRAND_jll v5.2.3+0
  [8c6ce2ba] rocSPARSE_jll v5.2.3+0
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [44cfe95a] Pkg v1.8.0
  [de0858da] Printf
  [9a3f8284] Random
  [10745b16] Statistics
Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated`

(AMDGPU) pkg> 

So it seems HSA is hsa_rocr_jll v4.5.2+1.

@pxl-th
Copy link
Member

pxl-th commented Oct 5, 2022

That's fine, it is still 4.5.2 version (revision 1).
If you need HIP_jll, rocBLAS_jll, rocRAND_jll, rocSPARSE_jll, make them of 4.5.2 version as well.

And do you still have issues with AMDGPU.ones(Float64, 2, 2)?

@luraess
Copy link
Contributor Author

luraess commented Oct 5, 2022

And do you still have issues with AMDGPU.ones(Float64, 2, 2)?

Yep, still same segfault as before.

@luraess
Copy link
Contributor Author

luraess commented Oct 5, 2022

Now, if I manually set this to const use_artifacts = @load_preference("use_artifacts", false) (true -> false) makes it possible to parse the ENV var and not use artifacts. Then, it works with:

julia> using AMDGPU

julia> AMDGPU.versioninfo()
HSA Runtime (ready)
- Path: /apps/ault/spack/opt/spack/linux-centos8-zen/gcc-8.4.1/hsa-rocr-dev-4.3.1-rpmxpbqcx7frbgvaz47bqngp7dnubtpn/lib/libhsa-runtime64.so.1
- Version: 1.1.0
ld.lld (ready)
- Path: /apps/ault/spack/opt/spack/linux-centos8-zen/gcc-8.4.1/llvm-amdgpu-4.2.0-rsmtqpi3nz4w2vj5qnvrghl5uyip5iy4/bin/ld.lld
ROCm-Device-Libs (ready)
- Path: /apps/ault/spack/opt/spack/linux-centos8-zen/gcc-8.4.1/llvm-amdgpu-4.3.1-hc5o4e3ffd6uxgm3wwmjf3mzndgwf6sn/amdgcn/bitcode
HIP Runtime (ready)
- Path: /apps/ault/spack/opt/spack/linux-centos8-zen/gcc-8.4.1/hip-4.3.1-izzfehtmyzlv2akxeoul4axzpyuqhpyx/lib/libamdhip64.so
rocBLAS (MISSING)
rocSOLVER (MISSING)
rocALUTION (MISSING)
rocSPARSE (MISSING)
rocRAND (MISSING)
rocFFT (MISSING)
MIOpen (MISSING)
HSA Agents (6):
- CPU-XX [AMD EPYC 7742 64-Core Processor]
- CPU-XX [AMD EPYC 7742 64-Core Processor]
- GPU-c616498172e626c4 [Vega 20 WKS GL-XE [Radeon Pro VII] (gfx906)]
- GPU-ebae40e172e620f6 [Vega 20 WKS GL-XE [Radeon Pro VII] (gfx906)]
- GPU-9ab2894172df8896 [Vega 20 WKS GL-XE [Radeon Pro VII] (gfx906)]
- GPU-3f28890172e620f4 [Vega 20 WKS GL-XE [Radeon Pro VII] (gfx906)]

julia> AMDGPU.ones(2,2)
2×2 ROCMatrix{Float32}:
 1.0  1.0
 1.0  1.0

@pxl-th
Copy link
Member

pxl-th commented Oct 5, 2022

And do you still have issues with AMDGPU.ones(Float64, 2, 2)?

Yep, still same segfault as before.

Hm... try making HIP_jll of 4.5.2 version. If we are doing hipStreamSynchronize, it will use HIP_jll and it needs to be 4.5.2.

@luraess
Copy link
Contributor Author

luraess commented Oct 5, 2022

try making HIP_jll of 4.5.2 version

How can you achieve this?

Also, was there a recent change to use HIP for stream sync instead of HSA?

@pxl-th
Copy link
Member

pxl-th commented Oct 5, 2022

try making HIP_jll of 4.5.2 version

How can you achieve this?

Also, was there a recent change to use HIP for stream sync instead of HSA?

Use ]add [email protected] command.

@luraess
Copy link
Contributor Author

luraess commented Oct 5, 2022

Moreover, why does one manually need to tweak artifact-usage parsing? This should happen automatically? or was there a design decision to only rely on artifacts since now?

@pxl-th
Copy link
Member

pxl-th commented Oct 5, 2022

Also, was there a recent change to use HIP for stream sync instead of HSA?

Both HSA and HIP are still used, although I think that hip sync is not needed always.
But it is there as a safeguard... at least for now.

@luraess
Copy link
Contributor Author

luraess commented Oct 5, 2022

With adding [email protected] it now works when using artifacts

julia> using AMDGPU
[ Info: Precompiling AMDGPU [21141c5a-9bdb-4563-92ae-f87d6854732e]

julia> AMDGPU.versioninfo()
HSA Runtime (ready)
- Path: /users/lraess/.julia/artifacts/20030eaff1f8f47d3646fc99d415a823516778d7/lib/libhsa-runtime64.so
- Version: 1.1.0
ld.lld (ready)
- Path: /users/lraess/.julia/artifacts/c86785d1da1b021c7790274eb700100581e341a5/tools/lld
ROCm-Device-Libs (ready)
- Path: /users/lraess/.julia/artifacts/2a4556d4ad40fc77472f80ad8a090d3ffea854bb/amdgcn/bitcode
HIP Runtime (ready)
- Path: /users/lraess/.julia/artifacts/75c9db09e48dc45cbebd0ba1127243517985c8d3/hip/lib/libamdhip64.so
rocBLAS (MISSING)
rocSOLVER (MISSING)
rocALUTION (MISSING)
rocSPARSE (MISSING)
rocRAND (MISSING)
rocFFT (MISSING)
MIOpen (MISSING)
HSA Agents (6):
- CPU-XX [AMD EPYC 7742 64-Core Processor]
- CPU-XX [AMD EPYC 7742 64-Core Processor]
- GPU-c616498172e626c4 [gfx906]
- GPU-ebae40e172e620f6 [gfx906]
- GPU-9ab2894172df8896 [gfx906]
- GPU-3f28890172e620f4 [gfx906]

julia> AMDGPU.ones(2,2)
2×2 ROCMatrix{Float32}:
 1.0  1.0
 1.0  1.0

(AMDGPU) pkg> st
Project AMDGPU v0.4.3
Status `/scratch/lraess/dev/AMDGPU.jl/Project.toml`
  [621f4979] AbstractFFTs v1.2.1
  [79e6a3ab] Adapt v3.4.0
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.2
  [e2ba6199] ExprTools v0.1.8
  [0c68f7d7] GPUArrays v8.5.0
  [61eb1bfa] GPUCompiler v0.16.4
  [929cbde3] LLVM v4.14.0 `https://github.com/maleadt/LLVM.jl.git#master`
  [1914dd2f] MacroTools v0.5.10
  [21216c6a] Preferences v1.3.0
  [ae029012] Requires v1.3.0
  [efcf1570] Setfield v1.1.1
  [276daf66] SpecialFunctions v2.1.7
⌃ [2696aab5] HIP_jll v4.5.2+2
⌅ [d55e3150] LLD_jll v12.0.0
⌅ [86de99a1] LLVM_jll v13.0.1+3
⌅ [873c0968] ROCmDeviceLibs_jll v4.5.2+0
⌅ [dd59ff1a] hsa_rocr_jll v4.5.2+1
  [1ef8cab2] rocBLAS_jll v5.2.3+1
  [a6151927] rocRAND_jll v5.2.3+0
  [8c6ce2ba] rocSPARSE_jll v5.2.3+0
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [44cfe95a] Pkg v1.8.0
  [de0858da] Printf
  [9a3f8284] Random
  [10745b16] Statistics
Info Packages marked with ⌃ and ⌅ have new versions available, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`

(AMDGPU) pkg> 

@luraess
Copy link
Contributor Author

luraess commented Oct 5, 2022

So what is unclear to me is:

  • how can "compatible" artifacts be downloaded without need to manually dev and engineer the pkg
  • why is the artifact-parsing not parsing the ENV var

@pxl-th
Copy link
Member

pxl-th commented Oct 5, 2022

Moreover, why does one manually need to tweak artifact parsing? This should happen automatically?

If by artifact parsing you mean stuff that happens in deps/bindeps.jl, I think it is because AMDGPU.jl allows to not use artifacts but system-wide installations.

or was there a design decision to only rely on artifacts since now?

I think mixing system-wide installations with artifacts is not allowed anymore as it may cause issues. But both cases are still supported.

There are now more and more artifacts available for ROCm related stuff, like rocBLAS, rocSPARSE and most recent MIOpen.

@luraess
Copy link
Contributor Author

luraess commented Oct 5, 2022

I think it is because AMDGPU.jl allows to not use artifacts but system-wide installations.

Yes, exactly. But having JULIA_AMDGPU_DISABLE_ARTIFACTS=1 did not allow to select the system-wide install. I needed to set false in the line you pointed out in deps/bindeps.jl in order to not use artifacts

@pxl-th
Copy link
Member

pxl-th commented Oct 5, 2022

So what is unclear to me is:

  • how can "compatible" artifacts be downloaded without need to manually dev and engineer the pkg

It should automatically download correct versions, but there are two things that may prevent this. One AMDGPU is shipping Manifest.toml files, which have hardcoded versions.

And second, this was probably introduced by me 😅, sometimes artifacts of version 5.2.3 will be installed, because for some of them there are no compat bound that says they should be installed only on Julia 1.9. But I'm planning on fixing that.
But that should not happen if AMDGPU does ship Manifest.toml.

  • why is the artifact-parsing not parsing the ENV var

I think that may be a bug you've bumped into in deps/bindeps.jl.

@luraess
Copy link
Contributor Author

luraess commented Oct 5, 2022

Thanks a lot for your insights and for your help 🙏!

  • That'd be great if one could try getting appropriate version of artifacts shipped in Manifest, and great if you can try to see if the 1.9 related fix will be added.

  • I'll see if I can find what goes wrong in deps/bindeps.jl

@pxl-th
Copy link
Member

pxl-th commented Oct 5, 2022

No problem!

  • I'll see if I can find what goes wrong in deps/bindeps.jl

I think the easiest fix would be to get rid of @set_preferences!, @load_preference and directly read env variable.
But maybe worth looking into why preferences are not working...

@jpsamaroo
Copy link
Member

The usage of Preferences is to allow globally or per-environment configuring whether artifacts get used, without having to use an env. var. With AMDGPU 0.4.3, because we removed the build step, the env. var is now read during precompile, which sometimes happens and sometimes not. If the user forgets to set the env. var consistently, then it can cause confusing behavior, and it's not easily possible to switch it on and off as easily as with AMDGPU.enable_artifacts!(::Bool).

The ROCm dependency version issues are known, as @pxl-th points out. We might be able to use Pkg hooks to manually select the right set of packages, as in https://github.com/JuliaBinaryWrappers/LLD_jll.jl/blob/main/.pkg/select_artifacts.jl; someone just needs to wire this up.

@luraess
Copy link
Contributor Author

luraess commented Oct 5, 2022

Thanks @jpsamaroo for the comments. Is the preference thing really needed, especially if it is not reliable? We could just replace it by

const use_artifacts = haskey(ENV, "JULIA_AMDGPU_DISABLE_ARTIFACTS") ? !parse(Bool, ENV["JULIA_AMDGPU_DISABLE_ARTIFACTS"]) : true

The ROCm dependency version issue would be nice to solve to allow smooth support for various still "recent" GPUs ;-)

@luraess
Copy link
Contributor Author

luraess commented Oct 5, 2022

@jpsamaroo shall one stick to using Preferences for this on just parse ENV var in the classical way as suggested above?

@pxl-th
Copy link
Member

pxl-th commented Jul 7, 2023

We now do runtime discovery of the deps.

@pxl-th pxl-th closed this as completed Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants