Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set publishNotReadyAddresses=true. Wait for DNS at startup #25

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

arcusfelis
Copy link
Contributor

@arcusfelis arcusfelis commented Nov 9, 2023

Changes:

  • Set publishNotReadyAddresses=true.
  • Wait for node to be resolvable DNS at startup. Uses config option.

See PR for:

What works:

  • Initial startup works fine, nxdomain error is gone. "requested to disconnect" error is gone.

How to test (clone, checkout first):

helm install test-mim MongooseIM --set replicaCount=10 --set image.tag=PR-4163 --set persistentDatabase=rdbms --set rdbms.username=ejabberd --set rdbms.database=ejabberd --set volatileDatabase=cets

(only PR-4163 works with this build because of new config options for wait_for_dns).
Nodes are joined:

kubectl exec -it mongooseim-0 -- mongooseimctl cets systemInfo

And logs are cleaner:

kubectl logs mongooseim-0

TODO:

  • We still see that issue in logs:
when=2023-11-09T17:48:58.555563+00:00 level=error what=join_retry reason=lock_aborted pid=<0.17788.0> at=cets_join:join_loop/6:90 local_pid=<0.550.0> remote_pid=<32213.552.0> remote_node=mongooseim@mongooseim-3.mongooseim.default.svc.cluster.local table=cets_cluster_id
when=2023-11-09T17:48:58.632915+00:00 level=error what=join_retry reason=lock_aborted pid=<0.17788.0> at=cets_join:join_loop/6:90 local_pid=<0.550.0> remote_pid=<32213.552.0> remote_node=mongooseim@mongooseim-3.mongooseim.default.svc.cluster.local table=cets_cluster_id
when=2023-11-09T17:48:58.743198+00:00 level=error what=join_retry reason=lock_aborted pid=<0.17788.0> at=cets_join:join_loop/6:90 local_pid=<0.550.0> remote_pid=<32213.552.0> remote_node=mongooseim@mongooseim-3.mongooseim.default.svc.cluster.local table=cets_cluster_id
when=2023-11-09T17:48:58.988527+00:00 level=error what=join_retry reason=lock_aborted pid=<0.17788.0> at=cets_join:join_loop/6:90 local_pid=<0.550.0> remote_pid=<32213.552.0> remote_node=mongooseim@mongooseim-3.mongooseim.default.svc.cluster.local table=cets_cluster_id
  • More testing for cluster updates.
  • Actually tests for CETS for DNS fixes (i.e. test that wait_for_dns works in tests, not just in the real life and manual testing).

New issue with updates (slow task with 20 seconds wait sounds bad, looks like waiting to get a global lock):

when=2023-11-09T17:52:32.787892+00:00 level=warning what=long_task_progress pid=<0.552.0> at=cets_long:monitor_loop/5:106 time_ms=15019 caller_pid=<0.337.0> task=cets_wait_for_ready current_stacktrace="[{gen,do_call,4,[{file,\"gen.erl\"},{line,240}]},{gen_server,call,3,[{file,\"gen_server.erl\"},{line,397}]},{cets_long,run_tracked,2,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_long.erl\"},{line,54}]},{mongoose_cluster_id,init_cache,1,[{file,\"/home/circleci/project/src/mongoose_clust er_id.erl\"},{line,108}]},{mongoose_cluster_id,start,0,[{file,\"/home/circleci/project/src/mongoose_cluster_id.erl\"},{line,28}]},{ejabberd_app,do_start,0,[{file,\"/home/circleci/project/src/ejabberd_app.erl\"},{line,76}]},{ejabberd_app,start,2,[{file,\"/home/circleci/project/src/ejabberd_app.erl\"},{line,44}]},{application_master,start_it_old,4,[{file,\"application_master.erl\"},{line,293}]}]"
when=2023-11-09T17:52:32.883835+00:00 level=error what=join_retry reason=lock_aborted pid=<0.584.0> at=cets_join:join_loop/6:90 local_pid=<0.550.0> remote_pid=<30975.548.0> remote_node=mongooseim@mongooseim-8.mongooseim.default.svc.cluster.local table=cets_cluster_id
when=2023-11-09T17:52:32.952959+00:00 level=error what=join_retry reason=lock_aborted pid=<0.584.0> at=cets_join:join_loop/6:90 local_pid=<0.550.0> remote_pid=<30975.548.0> remote_node=mongooseim@mongooseim-8.mongooseim.default.svc.cluster.local table=cets_cluster_id
when=2023-11-09T17:52:33.056021+00:00 level=error what=join_retry reason=lock_aborted pid=<0.584.0> at=cets_join:join_loop/6:90 local_pid=<0.550.0> remote_pid=<30975.548.0> remote_node=mongooseim@mongooseim-8.mongooseim.default.svc.cluster.local table=cets_cluster_id
when=2023-11-09T17:52:33.178240+00:00 level=warning what=long_task_progress pid=<0.618.0> at=cets_long:monitor_loop/5:106 long_task_name=join local_pid=<0.550.0> time_ms=15018 caller_pid=<0.584.0> remote_pid=<30975.548.0> remote_node=mongooseim@mongooseim-8.mongooseim.default.svc.cluster.local table=cets_cluster_id current_stacktrace="[{global,random_sleep,1,[{file,\"global.erl\"},{line,2925}]},{global,set_lock,4,[{file,\"global.erl\"},{line,435}]},{global,trans,4,[{file,\"global.erl\"},{line,474}]},{ce ts_join,join_loop,6,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,87}]},{cets_long,run_tracked,2,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_long.erl\"},{line,54}]},{cets_join,join,5,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,51}]},{cets_discovery,do_join,2,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_discovery.erl\"},{line,394}]},{cets_discovery,'-try_joining/1-lc$^1/1-1-',5,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_discovery.erl\"},{line,301}]}]"
when=2023-11-09T17:52:37.791668+00:00 level=warning what=long_task_progress pid=<0.552.0> at=cets_long:monitor_loop/5:106 time_ms=20023 caller_pid=<0.337.0> task=cets_wait_for_ready current_stacktrace="[{gen,do_call,4,[{file,\"gen.erl\"},{line,240}]},{gen_server,call,3,[{file,\"gen_server.erl\"},{line,397}]},{cets_long,run_tracked,2,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_long.erl\"},{line,54}]},{mongoose_cluster_id,init_cache,1,[{file,\"/home/circleci/project/src/mongoose_cluster_id.erl\"},{line,108}]},{mongoose_cluster_id,start,0,[{file,\"/home/circleci/project/src/mongoose_cluster_id.erl\"},{line,28}]},{ejabberd_app,do_start,0,[{file,\"/home/circleci/project/src/ejabberd_app.erl\"},{line,76}]},{ejabberd_app,start,2,[{file,\"/home/circleci/project/src/ejabberd_app.erl\"},{line,44}]},{application_master,start_it_old,4,[{file,\"application_master.erl\"},{line,293}]}]"

There is also that:

when=2023-11-09T17:52:28.079856+00:00 level=warning what=long_task_progress pid=<0.16038.0> at=cets_long:monitor_loop/5:106 local_pid=<0.548.0> time_ms=10014 caller_pid=<0.16037.0> remote_pid=<30979.550.0> remote_node=mongooseim@mongooseim-0.mongooseim.default.svc.cluster.local table=cets_cluster_id current_stacktrace="[{erpc,call,5,[{file,\"erpc.erl\"},{line,146}]},{rpc,call,5,[{file,\"rpc.erl\"},{line,401}]},{cets_long,run_tracked,2,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_long.erl\"},{line,54}]},{cets_join,'-check_could_reach_each_other/3-lc$^4/1-2-',1,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,213}]},{cets_join,'-check_could_reach_each_other/3-lc$^4/1-2-',1,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,218}]},{cets_join,check_could_reach_each_other,3,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,218}]},{cets_join,join2,4,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,109}]},{cets_join,'-handle_throw/1-fun-0-',1,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,253}]}]"
when=2023-11-09T17:52:30.139278+00:00 level=warning what=long_task_progress pid=<0.16106.0> at=cets_long:monitor_loop/5:106 node2=mongooseim@mongooseim-3.mongooseim.default.svc.cluster.local node1=mongooseim@mongooseim-0.mongooseim.default.svc.cluster.local time_ms=5004 caller_pid=<0.16037.0> task=ping_node current_stacktrace="[{erpc,call,5,[{file,\"erpc.erl\"},{line,146}]},{rpc,call,5,[{file,\"rpc.erl\"},{line,401}]},{cets_long,run_tracked,2,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_long.erl\"},{line,54}]},{cets_join,'-check_could_reach_each_other/3-lc$^4/1-2-',1,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,213}]},{cets_join,'-check_could_reach_each_other/3-lc$^4/1-2-',1,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,218}]},{cets_join,check_could_reach_each_other,3,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,218}]},{cets_join,join2,4,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,109}]},{cets_join,'-handle_throw/1-fun-0-',1,[{file,\"/home/circleci/project/_build/default/lib/cets/src/cets_join.erl\"},{line,253}]}]"
when=2023-11-09T17:52:32.176731+00:00 level=error what=check_could_reach_each_other_failed pid=<0.16037.0> at=cets_join:check_could_reach_each_other/3:225 node_pairs_not_connected=[{'[email protected]','[email protected]',pang},{'[email protected]','[email protected]',pang},{'[email protected]','[email protected]',pang}] local_pid=<0.548.0> remote_pid=<30979.550.0> remote_node=mongooseim@mongooseim-0.mongooseim.default.svc.cluster.local table=cets_cluster_id
I would assume, check_could_reach_each_other tries to do RPC call to just killed node during update and gets blocked waiting for a new connection to be established.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant