Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreaded simulations freeze sometimes #683

Open
efaulhaber opened this issue Jun 29, 2021 · 13 comments
Open

Multithreaded simulations freeze sometimes #683

efaulhaber opened this issue Jun 29, 2021 · 13 comments
Labels
bug Something isn't working upstream

Comments

@efaulhaber
Copy link
Member

I've had multithreaded simulations freeze multiple times lately. Interrupting the simulation works immediately, though.
The interruption error always looks like this:

ERROR: LoadError: InterruptException:
Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base .\task.jl:710
  [2] wait
    @ .\task.jl:769 [inlined]
  [3] yield()
    @ Base .\task.jl:662
  [4] wait
    @ C:\Users\Erik\.julia\packages\ThreadingUtilities\pkz6e\src\threadtasks.jl:62 [inlined]
  [5] wait
    @ C:\Users\Erik\.julia\packages\ThreadingUtilities\pkz6e\src\threadtasks.jl:57 [inlined]
  [6] macro expansion
    @ C:\Users\Erik\.julia\packages\Polyester\0DPCU\src\batch.jl:89 [inlined]
  [7] _batch_no_reserve
    @ C:\Users\Erik\.julia\packages\Polyester\0DPCU\src\batch.jl:53 [inlined]
  [8] batch
    @ C:\Users\Erik\.julia\packages\Polyester\0DPCU\src\batch.jl:195 [inlined]
  [9] macro expansion
    @ C:\Users\Erik\.julia\packages\Polyester\0DPCU\src\closure.jl:164 [inlined]
 [10] macro expansion
    @ c:\Users\Erik\git\Trixi.jl\src\auxiliary\auxiliary.jl:181 [inlined]
 [11] calc_volume_integral!(du::StrideArraysCore.PtrArray{Tuple{Static.StaticInt{5}, Static.StaticInt{4}, Static.StaticInt{4}, Static.StaticInt{4}, Int64}, (true, true, true, true, true), Float64, 5, 1, 0, (1, 2, 3, 4, 5), Tuple{Static.StaticInt{8}, Static.StaticInt{40}, Static.StaticInt{160}, Static.StaticInt{640}, Static.StaticInt{2560}}, NTuple{5, Static.StaticInt{1}}}, u::StrideArraysCore.PtrArray{Tuple{Static.StaticInt{5}, Static.StaticInt{4}, Static.StaticInt{4}, Static.StaticInt{4}, Int64}, (true, true, true, true, true), Float64, 5, 1, 0, (1, 2, 3, 4, 5), Tuple{Static.StaticInt{8}, Static.StaticInt{40}, Static.StaticInt{160}, Static.StaticInt{640}, Static.StaticInt{2560}}, NTuple{5, Static.StaticInt{1}}}, mesh::P4estMesh{3, Float64, Ptr{P4est.LibP4est.p8est}, 5, 4}, nonconservative_terms::Val{false}, equations::CompressibleEulerEquations3D{Float64}, volume_integral::VolumeIntegralShockCapturingHG{typeof(flux_chandrashekar), FluxLaxFriedrichs{typeof(max_abs_speed_naive)}, IndicatorHennemannGassner{Float64, typeof(density_pressure), NamedTuple{(:alpha, :alpha_tmp, :indicator_threaded, :modal_threaded, :modal_tmp1_threaded, :modal_tmp2_threaded), Tuple{Vector{Float64}, Vector{Float64}, Vector{Array{Float64, 3}}, Vector{Array{Float64, 3}}, Vector{Array{Float64, 3}}, Vector{Array{Float64, 3}}}}}}, dg::DGSEM{LobattoLegendreBasis{Float64, 4, SVector{4, Float64}, 
Matrix{Float64}, Matrix{Float64}, Matrix{Float64}}, Trixi.LobattoLegendreMortarL2{Float64, 4, Matrix{Float64}, Matrix{Float64}}, SurfaceIntegralWeakForm{FluxLaxFriedrichs{typeof(max_abs_speed_naive)}}, VolumeIntegralShockCapturingHG{typeof(flux_chandrashekar), FluxLaxFriedrichs{typeof(max_abs_speed_naive)}, IndicatorHennemannGassner{Float64, typeof(density_pressure), NamedTuple{(:alpha, :alpha_tmp, :indicator_threaded, :modal_threaded, :modal_tmp1_threaded, :modal_tmp2_threaded), Tuple{Vector{Float64}, Vector{Float64}, Vector{Array{Float64, 3}}, Vector{Array{Float64, 3}}, Vector{Array{Float64, 3}}, Vector{Array{Float64, 3}}}}}}}, cache::NamedTuple{(:elements, :interfaces, :boundaries, :mortars, :element_ids_dg, :element_ids_dgfv, :fstar1_threaded, :fstar2_threaded, :fstar3_threaded, :fstar_threaded, :fstar_tmp_threaded, :u_threaded), Tuple{Trixi.P4estElementContainer{3, Float64, Float64, 4, 5, 6}, Trixi.P4estInterfaceContainer{3, Float64, 5}, Trixi.P4estBoundaryContainer{3, Float64, 4}, Trixi.P4estMortarContainer{3, Float64, 4, 6}, Vector{Int64}, Vector{Int64}, Vector{Array{Float64, 4}}, Vector{Array{Float64, 4}}, Vector{Array{Float64, 4}}, Vector{Array{Float64, 4}}, Vector{Array{Float64, 3}}, Vector{Array{Float64, 3}}}})
    @ Trixi c:\Users\Erik\git\Trixi.jl\src\solvers\dgsem_tree\dg_3d.jl:434

It's always the @threaded block of the used volume integral.

Has this happened to anyone else before? Is there a way to debug stuff like this?

@efaulhaber efaulhaber added the bug Something isn't working label Jun 29, 2021
@jlchan
Copy link
Contributor

jlchan commented Jun 29, 2021

I have the same issue intermittently (most recently related to JuliaLinearAlgebra/Octavian.jl#103).

I've also noticed this with @batch in my other solver codes. I've noticed that sometimes a "reset" helps
https://github.com/jlchan/ESDG.jl/blob/318bb4b6739f1393da92800b74291024917d684f/src/misc_utils.jl#L14-L22

Polyester.reset_workers!()
ThreadingUtilities.reinitialize_tasks!()

@efaulhaber
Copy link
Member Author

How do I do a reset in a frozen simulation?

@jlchan
Copy link
Contributor

jlchan commented Jun 29, 2021

If the simulation hangs, then I have to interrupt (ctrl or command-c). Sometimes the simulation completely freezes, in which case restarting the REPL is the only way I've found.

@efaulhaber
Copy link
Member Author

After interrupting, I can just start the simulation again (not sure if it then hangs somewhere else).
Also, you mentioned in JuliaLinearAlgebra/Octavian.jl#103 that this only happens to you when running code from vscode. This happened to me in a normal REPL as well.

@jlchan
Copy link
Contributor

jlchan commented Jun 29, 2021

Yeah, the Octavian issue is specific to using Polyester, Octavian, and VSCode. I've had @batch freeze when running some of my other solvers through a regular REPL session too.

@jlchan
Copy link
Contributor

jlchan commented Jun 29, 2021

I've found that the hanging is sometimes related to error handling, e.g., complex square root argument errors.

@efaulhaber
Copy link
Member Author

Yes! This seems to be exactly what's happening here. My simulations only freeze at the exact same time step where sqrt throws a DomainError whenever it's not freezing.

@jlchan
Copy link
Contributor

jlchan commented Jun 29, 2021

Perhaps we should raise an issue in Polyester.jl, esp if we could put a MWE together?

@efaulhaber
Copy link
Member Author

Alright, I found the problem and created two issues. See JuliaSIMD/Polyester.jl#30 and JuliaSIMD/Polyester.jl#31.

@ranocha
Copy link
Member

ranocha commented Jul 13, 2021

Does this still happen for you? Or is it resolved (maybe by #708) on main?

@efaulhaber
Copy link
Member Author

It now shows the error instead of freezing, but it still runs only serially after that for the session.

@ranocha
Copy link
Member

ranocha commented Jul 14, 2021

Thanks for testing it again.

@efaulhaber
Copy link
Member Author

I take it back, it's still freezing sometimes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working upstream
Projects
None yet
Development

No branches or pull requests

3 participants