Set publishNotReadyAddresses=true. Wait for DNS at startup #25
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes:
See PR for:
What works:
How to test (clone, checkout first):
(only PR-4163 works with this build because of new config options for wait_for_dns).
Nodes are joined:
And logs are cleaner:
TODO:
New issue with updates (slow task with 20 seconds wait sounds bad, looks like waiting to get a global lock):
when=2023-11-09T17:52:32.787892+00:00 level=warning what=long_task_progress pid=<0.552.0> at=cets_long:monitor_loop/5:106 time_ms=15019 caller_pid=<0.337.0> task=cets_wait_for_ready current_stacktrace="[{gen,do_call,4,[{file,\"gen.erl\"},{line,240}]},{gen_server,call,3,[{file,\"gen_server.erl\"},{line,397}]},{cets_long,run_tracked,2,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_long.erl\"},{line,54}]},{mongoose_cluster_id,init_cache,1,[{file,\"/home/circleci/project/src/mongoose_clust er_id.erl\"},{line,108}]},{mongoose_cluster_id,start,0,[{file,\"/home/circleci/project/src/mongoose_cluster_id.erl\"},{line,28}]},{ejabberd_app,do_start,0,[{file,\"/home/circleci/project/src/ejabberd_app.erl\"},{line,76}]},{ejabberd_app,start,2,[{file,\"/home/circleci/project/src/ejabberd_app.erl\"},{line,44}]},{application_master,start_it_old,4,[{file,\"application_master.erl\"},{line,293}]}]"
when=2023-11-09T17:52:32.883835+00:00 level=error what=join_retry reason=lock_aborted pid=<0.584.0> at=cets_join:join_loop/6:90 local_pid=<0.550.0> remote_pid=<30975.548.0> remote_node=mongooseim@mongooseim-8.mongooseim.default.svc.cluster.local table=cets_cluster_id
when=2023-11-09T17:52:32.952959+00:00 level=error what=join_retry reason=lock_aborted pid=<0.584.0> at=cets_join:join_loop/6:90 local_pid=<0.550.0> remote_pid=<30975.548.0> remote_node=mongooseim@mongooseim-8.mongooseim.default.svc.cluster.local table=cets_cluster_id
when=2023-11-09T17:52:33.056021+00:00 level=error what=join_retry reason=lock_aborted pid=<0.584.0> at=cets_join:join_loop/6:90 local_pid=<0.550.0> remote_pid=<30975.548.0> remote_node=mongooseim@mongooseim-8.mongooseim.default.svc.cluster.local table=cets_cluster_id
when=2023-11-09T17:52:33.178240+00:00 level=warning what=long_task_progress pid=<0.618.0> at=cets_long:monitor_loop/5:106 long_task_name=join local_pid=<0.550.0> time_ms=15018 caller_pid=<0.584.0> remote_pid=<30975.548.0> remote_node=mongooseim@mongooseim-8.mongooseim.default.svc.cluster.local table=cets_cluster_id current_stacktrace="[{global,random_sleep,1,[{file,\"global.erl\"},{line,2925}]},{global,set_lock,4,[{file,\"global.erl\"},{line,435}]},{global,trans,4,[{file,\"global.erl\"},{line,474}]},{ce ts_join,join_loop,6,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,87}]},{cets_long,run_tracked,2,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_long.erl\"},{line,54}]},{cets_join,join,5,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,51}]},{cets_discovery,do_join,2,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_discovery.erl\"},{line,394}]},{cets_discovery,'-try_joining/1-lc$^1/1-1-',5,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_discovery.erl\"},{line,301}]}]"
when=2023-11-09T17:52:37.791668+00:00 level=warning what=long_task_progress pid=<0.552.0> at=cets_long:monitor_loop/5:106 time_ms=20023 caller_pid=<0.337.0> task=cets_wait_for_ready current_stacktrace="[{gen,do_call,4,[{file,\"gen.erl\"},{line,240}]},{gen_server,call,3,[{file,\"gen_server.erl\"},{line,397}]},{cets_long,run_tracked,2,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_long.erl\"},{line,54}]},{mongoose_cluster_id,init_cache,1,[{file,\"/home/circleci/project/src/mongoose_cluster_id.erl\"},{line,108}]},{mongoose_cluster_id,start,0,[{file,\"/home/circleci/project/src/mongoose_cluster_id.erl\"},{line,28}]},{ejabberd_app,do_start,0,[{file,\"/home/circleci/project/src/ejabberd_app.erl\"},{line,76}]},{ejabberd_app,start,2,[{file,\"/home/circleci/project/src/ejabberd_app.erl\"},{line,44}]},{application_master,start_it_old,4,[{file,\"application_master.erl\"},{line,293}]}]"
There is also that:
when=2023-11-09T17:52:28.079856+00:00 level=warning what=long_task_progress pid=<0.16038.0> at=cets_long:monitor_loop/5:106 local_pid=<0.548.0> time_ms=10014 caller_pid=<0.16037.0> remote_pid=<30979.550.0> remote_node=mongooseim@mongooseim-0.mongooseim.default.svc.cluster.local table=cets_cluster_id current_stacktrace="[{erpc,call,5,[{file,\"erpc.erl\"},{line,146}]},{rpc,call,5,[{file,\"rpc.erl\"},{line,401}]},{cets_long,run_tracked,2,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_long.erl\"},{line,54}]},{cets_join,'-check_could_reach_each_other/3-lc$^4/1-2-',1,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,213}]},{cets_join,'-check_could_reach_each_other/3-lc$^4/1-2-',1,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,218}]},{cets_join,check_could_reach_each_other,3,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,218}]},{cets_join,join2,4,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,109}]},{cets_join,'-handle_throw/1-fun-0-',1,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,253}]}]"
when=2023-11-09T17:52:30.139278+00:00 level=warning what=long_task_progress pid=<0.16106.0> at=cets_long:monitor_loop/5:106 node2=mongooseim@mongooseim-3.mongooseim.default.svc.cluster.local node1=mongooseim@mongooseim-0.mongooseim.default.svc.cluster.local time_ms=5004 caller_pid=<0.16037.0> task=ping_node current_stacktrace="[{erpc,call,5,[{file,\"erpc.erl\"},{line,146}]},{rpc,call,5,[{file,\"rpc.erl\"},{line,401}]},{cets_long,run_tracked,2,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_long.erl\"},{line,54}]},{cets_join,'-check_could_reach_each_other/3-lc$^4/1-2-',1,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,213}]},{cets_join,'-check_could_reach_each_other/3-lc$^4/1-2-',1,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,218}]},{cets_join,check_could_reach_each_other,3,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,218}]},{cets_join,join2,4,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,109}]},{cets_join,'-handle_throw/1-fun-0-',1,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,253}]}]"
when=2023-11-09T17:52:32.176731+00:00 level=error what=check_could_reach_each_other_failed pid=<0.16037.0> at=cets_join:check_could_reach_each_other/3:225 node_pairs_not_connected=[{'[email protected]','[email protected]',pang},{'[email protected]','[email protected]',pang},{'[email protected]','[email protected]',pang}] local_pid=<0.548.0> remote_pid=<30979.550.0> remote_node=mongooseim@mongooseim-0.mongooseim.default.svc.cluster.local table=cets_cluster_id
I would assume,
check_could_reach_each_other
tries to do RPC call to just killed node during update and gets blocked waiting for a new connection to be established.