Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Something invalidating Enzyme.__init__ causes a heavy package load time penalty #776

Open
KristofferC opened this issue Apr 24, 2023 · 19 comments · May be fixed by #1305
Open

Something invalidating Enzyme.__init__ causes a heavy package load time penalty #776

KristofferC opened this issue Apr 24, 2023 · 19 comments · May be fixed by #1305

Comments

@KristofferC
Copy link

When loading OmniPackage.jl(https://github.com/JuliaComputing/OmniPackage.jl/) a big part of its load time (around 10s) is spent compiling and running the __init__ method in Enzyme, profile available at https://kristofferc.github.io/tracy-traces/omnipackage/.

Ideally, Enzyme would somehow protect itself against this, by either making the code in __init__ less vulnerable to invalidations or some other way (maybe using a frozen world age).

cc @vchuravy

@wsmoses
Copy link
Member

wsmoses commented Jun 3, 2023

This should now be resolved by #881

@wsmoses wsmoses closed this as completed Jun 3, 2023
@vchuravy
Copy link
Member

vchuravy commented Jun 4, 2023

I am not sure that fixed all instances. Need to confirm with an actual profile

@KristofferC
Copy link
Author

I just want to point out that this is still happening and causes ~18s load time increase to any package that recursively ends up getting Enzyme as a dependency.

@wsmoses
Copy link
Member

wsmoses commented Nov 16, 2023

hm @KristofferC do you have any information regarding what could be causing this?

@KristofferC
Copy link
Author

  • Someone adds a method that invalidates some function called in Enzyme.__init__.
  • Enzyme.__init__ in invalidated and has to be recompiled at load time.
  • Compiling Enzyme.__init__ takes about 18 seconds.

@wsmoses
Copy link
Member

wsmoses commented Nov 16, 2023

is there a tool or something to profile what methods/invalidations are the source?

@KristofferC
Copy link
Author

One would usually use SnoopCompile. I can try (later) to see if I can find what actually ends up invalidating it.

@wsmoses
Copy link
Member

wsmoses commented Nov 16, 2023

I will note that the Enzyme Blas init loader presently is expected to have a long load time and I'm not sure of how to fix that.

Specifically here:

function __init__()

We need the runtime address of blas and julia runtime libraries to be able to identify what function is being ccall'ed (in IR it is an integer). This is definitionally runtime information so it couldn't be preserved. If you have ideas for a design that avoid the expense, I'm all ears.

If it's somewhere else causing the issue definitely interested in what's causing it so we can fix it.

@KristofferC
Copy link
Author

KristofferC commented Nov 17, 2023

julia> @time using Enzyme
  0.383814 seconds (566.30 k allocations: 33.282 MiB, 21.24% compilation time: 27% of which was recompilation)

But:

julia> @time using Tracker
1.450969 seconds (1.83 M allocations: 109.970 MiB, 7.33% gc time, 31.56% compilation time: 26% of which was recompilation)

julia> @time using Enzyme
 21.592203 seconds (43.97 M allocations: 2.517 GiB, 1.84% gc time, 98.84% compilation time: 100% of which was recompilation)

is some sort of MWE. Will narrow it down more.

@KristofferC
Copy link
Author

Okej, Tracker committed the mortal sin of defining show on a Type. With FluxML/Tracker.jl#156 this is improved to

julia> @time using Tracker
  1.463012 seconds (1.90 M allocations: 113.004 MiB, 4.50% gc time, 25.19% compilation time: 16% of which was recompilation)

julia> @time using Enzyme
  0.590007 seconds (589.92 k allocations: 33.254 MiB, 14.00% gc time, 56.16% compilation time: 99% of which was recompilation)

Anyway, as a question, is there anyway this work can be moved from __init__ to until when it is actually needed? So you pay the cost if you use it but otherwise not?

@staticfloat
Copy link

It seems to me that that __init__() is just setting up ptr_map which is only used by restore_lookup, which could just do an initialization check, and if ptr_map is not initialized, it runs that setup code.

@KristofferC
Copy link
Author

From some testing, it seems the __init__ that takes a long time to compile is

Enzyme.jl/src/compiler.jl

Lines 2425 to 2454 in 019258a

function __init__()
API.EnzymeSetHandler(@cfunction(julia_error, LLVM.API.LLVMValueRef, (Cstring, LLVM.API.LLVMValueRef, API.ErrorType, Ptr{Cvoid}, LLVM.API.LLVMValueRef, LLVM.API.LLVMBuilderRef)))
API.EnzymeSetSanitizeDerivatives(@cfunction(julia_sanitize, LLVM.API.LLVMValueRef, (LLVM.API.LLVMValueRef, LLVM.API.LLVMValueRef, LLVM.API.LLVMBuilderRef, LLVM.API.LLVMValueRef)));
if API.EnzymeHasCustomInactiveSupport()
API.EnzymeSetRuntimeInactiveError(@cfunction(emit_inacterror, Cvoid, (LLVM.API.LLVMBuilderRef, LLVM.API.LLVMValueRef, LLVM.API.LLVMValueRef)))
end
if API.EnzymeHasCustomAllocatorSupport()
API.EnzymeSetDefaultTapeType(@cfunction(
julia_default_tape_type, LLVM.API.LLVMTypeRef, (LLVM.API.LLVMContextRef,)))
API.EnzymeSetCustomAllocator(@cfunction(
julia_allocator, LLVM.API.LLVMValueRef,
(LLVM.API.LLVMBuilderRef, LLVM.API.LLVMTypeRef, LLVM.API.LLVMValueRef, LLVM.API.LLVMValueRef, UInt8, Ptr{LLVM.API.LLVMValueRef})))
API.EnzymeSetCustomDeallocator(@cfunction(
julia_deallocator, LLVM.API.LLVMValueRef, (LLVM.API.LLVMBuilderRef, LLVM.API.LLVMValueRef)))
API.EnzymeSetPostCacheStore(@cfunction(
julia_post_cache_store, Ptr{LLVM.API.LLVMValueRef},
(LLVM.API.LLVMValueRef, LLVM.API.LLVMBuilderRef, Ptr{UInt64})))
API.EnzymeSetCustomZero(@cfunction(
zero_allocation, Cvoid,
(LLVM.API.LLVMBuilderRef, LLVM.API.LLVMTypeRef, LLVM.API.LLVMValueRef, UInt8)))
API.EnzymeSetFixupReturn(@cfunction(
fixup_return, LLVM.API.LLVMValueRef,
(LLVM.API.LLVMBuilderRef, LLVM.API.LLVMValueRef)))
end
API.EnzymeSetUndefinedValueForType(@cfunction(
julia_undef_value_for_type, LLVM.API.LLVMValueRef, (LLVM.API.LLVMTypeRef,UInt8)))
register_alloc_rules()
register_llvm_rules()
end

@wsmoses
Copy link
Member

wsmoses commented Nov 19, 2023

Okay if that init takes longer to compile I am a bit more unclear how to resolve it.

So some context, this code interfaces between Enzyme proper (aka Enzyme_jll) and Enzyme.jl to registister all of the julia specific handling. As a result this actually is most of Enzyme.jl's code, so this is highly desirable to be precompiled.

this is done by taking many functions in Julia code, making a cfunction of it, then passing the Julia cfunction runtime pointer into Enzyme_jll. While the actual call to pass the pointer (and the pointer itself) are runtime specific values, the actual compilation of those cfunctions on those specified types is fixed, and should be in theory able to be precompiled (but I presume not in practice here).

Those cfunctions are wrapped in a macro (see here:

macro augfunc(f)
) which creates an anonymous wrapper function to make it easier, which perhaps makes things more difficult.

Regardless, I'm not sure what the state of julia internals for precompilation are here, but that's what I can see from our end.

@KristofferC
Copy link
Author

Running the register_llvm_rules() in a module where optimizations are turned off more or less fixes the bad behavior:

KristofferC@4246ddb#diff-21145f7299f4174cfb41f7c0c845a0ee798fc065bae0a352171552dc4b810508R1033

However, I have no idea what impact this has on runtime performance so that would have to be investigated. Might not be a good idea at all.

@wsmoses
Copy link
Member

wsmoses commented Nov 20, 2023 via email

@KristofferC
Copy link
Author

Is there any other way we can cut down on load time here?

I guess these are the possibilities:

  • Move things from load time to "on demand time" (compute things when they are actually needed).
  • Make the code in Enzyme easier to compile / move more of the Enzyme __init__ to precompilation.
  • Improve the Julia compiler.

@vchuravy
Copy link
Member

Yeah disabling optimizations seems like a bad idea.

I think disabling optimizations for the registration functions should be fine.
But I wonder if it impacts the callbacks, if it is only the trampoline that should be fine.

@KristofferC
Copy link
Author

But I wonder if it impacts the callbacks, if it is only the trampoline that should be fine.

That's what I was worried about as well. I think the compilation time was in fact partly due to inferring the callbacks so my change making things faster probably means the callbacks also stopped getting inferred...

@vchuravy
Copy link
Member

Concretly we are spending ~30s on compiling init

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants