Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] segfault in freeswitch/load_balancer module after connection loss #3468

Closed
spacetourist opened this issue Sep 12, 2024 · 3 comments
Closed

Comments

@spacetourist
Copy link
Contributor

OpenSIPS version you are running

version: opensips 3.2.10 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
main.c compiled on 15:42:44 Dec 20 2022 with gcc 4.8.5

Describe the bug

In my setup the load balancer module has a list of FreeSWITCH destinations in the database, each instance sends a HEARTBEAT every second according to this config:

modparam("freeswitch", "event_heartbeat_interval", 1)
modparam("load_balancer", "db_table", "load_balancer_fr")
modparam("load_balancer", "probing_interval", 1)
modparam("load_balancer", "fetch_freeswitch_stats", 1)
modparam("load_balancer", "initial_freeswitch_load", 5000)

The initial fault is that when the connection to the FS server is severed abruptly we are not sending any TCP keep alives or heartbeats and we do not reconnect when the server comes back online. In this state the data is stale and the destination is still active so when the server comes online calls will be delivered but without the heartbeat data the destinations load is not counted resulting in too many calls being delivered to the instance.

The segfault occurs when using opensips-cli to attempt to restore the connection. Attempting mi lb_reload does nothing as OpenSIPs continues to believe the connection is OK in an effort to clear this I remove the DB record for the impacted destination and mi lb_reload followed by reinstating the DB record and issuing mi lb_reload again. This causes OpenSIPs to crash with the following output:

2024-09-10T09:12:42.007871+01:00 FR-P-SIPSBC-1 opensips[40744]: ERROR:freeswitch:handle_reconnects: failed to connect to FS sock '192.168.151.229:8021'
2024-09-10T09:12:42.008202+01:00 FR-P-SIPSBC-1 opensips[40744]: ERROR:freeswitch:io_watch_del: [FS Manager] trying to delete already erased entry 0 in the hash(0, 0, (nil)) )
2024-09-10T09:12:42.008454+01:00 FR-P-SIPSBC-1 opensips[40744]: ERROR:freeswitch:destroy_fs_evs: del failed for sock 0
2024-09-10T09:12:42.008652+01:00 FR-P-SIPSBC-1 opensips[40744]: ERROR:freeswitch:destroy_fs_evs: disconnect error 1 on FS sock 192.168.151.229:8021
2024-09-10T09:12:42.008950+01:00 FR-P-SIPSBC-1 opensips[40744]: ERROR:freeswitch:handle_io: failed to destroy FS evs!
2024-09-10T09:12:42.009161+01:00 FR-P-SIPSBC-1 opensips[40744]: CRITICAL:core:sig_usr: segfault in process pid: 40744, id: 1

To Reproduce

This can be reproduced simply by:

  • Setup OpenSIPs load_balancer with a freeswitch destination and 1s HEARTBEAT
  • Use tcpdump to verify that heartbeats arrive every second
  • Power off the FS server and then start it back up
  • Use tcpdump to verify that heartbeats do not resume when the server comes back online
  • Remove the destination from the load_balancer table and issue mi lb_reload
  • Add the destination back in and issue mi lb_reload
  • Observe segfault

Analysis

When a remote FS server restarts gracefully OpenSIPs will receive a TCP FIN and will issue SYN packets every second until the server comes back online at which point the ESL connection is automatically re-established.

When the FS server is instead halted abruptly (power off) the server will not send the FIN and OpenSIPs will not start its SYN polling meaning that when the FS server comes back online it is never detected and OpenSIPs will not reconnect. In this state the load_balancer module is left with stale data for the impacted host. As the load balancer has not been told to disable the destination calls will start to be allocated to the instance throughout this period, when offline these fail over gracefully to another instance however once the server is back online it will start to get calls whilst never getting heartbeat data causing the instance to get overloaded (I have multiple OpenSIPs instances feeding into the same pool of FreeSWITCH).

Expected behavior

The lb_reload operation should handle the reconnection process gracefully, stale connections should be detected and replaced without panic.

The system should detect a stale destination which has failed to send an ESL HEARTBEAT and either disable the host or attempt reconnection until it returns.

Additionally the HB arrival is not currently tracked and exposed via opensips-cli so I have no way to implement effective monitoring of this scenario, exposing this data would be really helpful.

OS/environment information

  • Operating System: Almalinux 9
  • OpenSIPS installation: Manual packages
  • other relevant information:
@spacetourist spacetourist changed the title [BUG] segault in freeswitch/load_balancer module after connection loss [BUG] segfault in freeswitch/load_balancer module after connection loss Sep 12, 2024
@bogdan-iancu
Copy link
Member

@spacetourist , the 3.2 version is not maintained anymore (see here). If you consider moving to 3.4, please also upload a full backtrace of the crash.

Copy link

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@github-actions github-actions bot added the stale label Oct 10, 2024
Copy link

Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants