You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In my setup the load balancer module has a list of FreeSWITCH destinations in the database, each instance sends a HEARTBEAT every second according to this config:
The initial fault is that when the connection to the FS server is severed abruptly we are not sending any TCP keep alives or heartbeats and we do not reconnect when the server comes back online. In this state the data is stale and the destination is still active so when the server comes online calls will be delivered but without the heartbeat data the destinations load is not counted resulting in too many calls being delivered to the instance.
The segfault occurs when using opensips-cli to attempt to restore the connection. Attempting mi lb_reload does nothing as OpenSIPs continues to believe the connection is OK in an effort to clear this I remove the DB record for the impacted destination and mi lb_reload followed by reinstating the DB record and issuing mi lb_reload again. This causes OpenSIPs to crash with the following output:
2024-09-10T09:12:42.007871+01:00 FR-P-SIPSBC-1 opensips[40744]: ERROR:freeswitch:handle_reconnects: failed to connect to FS sock '192.168.151.229:8021'
2024-09-10T09:12:42.008202+01:00 FR-P-SIPSBC-1 opensips[40744]: ERROR:freeswitch:io_watch_del: [FS Manager] trying to delete already erased entry 0 in the hash(0, 0, (nil)) )
2024-09-10T09:12:42.008454+01:00 FR-P-SIPSBC-1 opensips[40744]: ERROR:freeswitch:destroy_fs_evs: del failed for sock 0
2024-09-10T09:12:42.008652+01:00 FR-P-SIPSBC-1 opensips[40744]: ERROR:freeswitch:destroy_fs_evs: disconnect error 1 on FS sock 192.168.151.229:8021
2024-09-10T09:12:42.008950+01:00 FR-P-SIPSBC-1 opensips[40744]: ERROR:freeswitch:handle_io: failed to destroy FS evs!
2024-09-10T09:12:42.009161+01:00 FR-P-SIPSBC-1 opensips[40744]: CRITICAL:core:sig_usr: segfault in process pid: 40744, id: 1
To Reproduce
This can be reproduced simply by:
Setup OpenSIPs load_balancer with a freeswitch destination and 1s HEARTBEAT
Use tcpdump to verify that heartbeats arrive every second
Power off the FS server and then start it back up
Use tcpdump to verify that heartbeats do not resume when the server comes back online
Remove the destination from the load_balancer table and issue mi lb_reload
Add the destination back in and issue mi lb_reload
Observe segfault
Analysis
When a remote FS server restarts gracefully OpenSIPs will receive a TCP FIN and will issue SYN packets every second until the server comes back online at which point the ESL connection is automatically re-established.
When the FS server is instead halted abruptly (power off) the server will not send the FIN and OpenSIPs will not start its SYN polling meaning that when the FS server comes back online it is never detected and OpenSIPs will not reconnect. In this state the load_balancer module is left with stale data for the impacted host. As the load balancer has not been told to disable the destination calls will start to be allocated to the instance throughout this period, when offline these fail over gracefully to another instance however once the server is back online it will start to get calls whilst never getting heartbeat data causing the instance to get overloaded (I have multiple OpenSIPs instances feeding into the same pool of FreeSWITCH).
Expected behavior
The lb_reload operation should handle the reconnection process gracefully, stale connections should be detected and replaced without panic.
The system should detect a stale destination which has failed to send an ESL HEARTBEAT and either disable the host or attempt reconnection until it returns.
Additionally the HB arrival is not currently tracked and exposed via opensips-cli so I have no way to implement effective monitoring of this scenario, exposing this data would be really helpful.
The text was updated successfully, but these errors were encountered:
spacetourist
changed the title
[BUG] segault in freeswitch/load_balancer module after connection loss
[BUG] segfault in freeswitch/load_balancer module after connection loss
Sep 12, 2024
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
OpenSIPS version you are running
Describe the bug
In my setup the load balancer module has a list of FreeSWITCH destinations in the database, each instance sends a HEARTBEAT every second according to this config:
The initial fault is that when the connection to the FS server is severed abruptly we are not sending any TCP keep alives or heartbeats and we do not reconnect when the server comes back online. In this state the data is stale and the destination is still active so when the server comes online calls will be delivered but without the heartbeat data the destinations load is not counted resulting in too many calls being delivered to the instance.
The segfault occurs when using opensips-cli to attempt to restore the connection. Attempting
mi lb_reload
does nothing as OpenSIPs continues to believe the connection is OK in an effort to clear this I remove the DB record for the impacted destination andmi lb_reload
followed by reinstating the DB record and issuingmi lb_reload
again. This causes OpenSIPs to crash with the following output:To Reproduce
This can be reproduced simply by:
mi lb_reload
mi lb_reload
Analysis
When a remote FS server restarts gracefully OpenSIPs will receive a TCP FIN and will issue SYN packets every second until the server comes back online at which point the ESL connection is automatically re-established.
When the FS server is instead halted abruptly (power off) the server will not send the FIN and OpenSIPs will not start its SYN polling meaning that when the FS server comes back online it is never detected and OpenSIPs will not reconnect. In this state the load_balancer module is left with stale data for the impacted host. As the load balancer has not been told to disable the destination calls will start to be allocated to the instance throughout this period, when offline these fail over gracefully to another instance however once the server is back online it will start to get calls whilst never getting heartbeat data causing the instance to get overloaded (I have multiple OpenSIPs instances feeding into the same pool of FreeSWITCH).
Expected behavior
The lb_reload operation should handle the reconnection process gracefully, stale connections should be detected and replaced without panic.
The system should detect a stale destination which has failed to send an ESL HEARTBEAT and either disable the host or attempt reconnection until it returns.
Additionally the HB arrival is not currently tracked and exposed via opensips-cli so I have no way to implement effective monitoring of this scenario, exposing this data would be really helpful.
OS/environment information
The text was updated successfully, but these errors were encountered: