Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Active dialogs skyrocket #3321

Open
Lt-Flash opened this issue Feb 29, 2024 · 27 comments
Open

[BUG] Active dialogs skyrocket #3321

Lt-Flash opened this issue Feb 29, 2024 · 27 comments

Comments

@Lt-Flash
Copy link

OpenSIPS version you are running

version: opensips 3.3.9 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-FUTEX-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 39c4e245f
main.c compiled on 20:50:23 Feb 15 2024 with gcc 9

Describe the bug

I'm using a dialog module with just a memory storage, no MySQL or any other storage is connected. Also I'm using a full-sharing cluster of three nodes. Initially dialogs were showing statistics correctly but then suddenly it started to show a huge number of active dialogs. In opensips-cli -x mi dlg_list it only shows real dialog number, but in opensips-cli -x mi get_statistics all it show following (related to dialog module):

    "dialog:active_dialogs": 1.844674e+19,
    "dialog:early_dialogs": 12,
    "dialog:processed_dialogs": 64701,
    "dialog:expired_dialogs": 0,
    "dialog:failed_dialogs": 36074,
    "dialog:create_sent": 9115,
    "dialog:update_sent": 10587,
    "dialog:delete_sent": 9121,
    "dialog:create_recv": 19457,
    "dialog:update_recv": 22791,
    "dialog:delete_recv": 19483,

Here's a graph built by Grafana using Prometheus module for OpenSIPS:

image

To Reproduce

Just start OpenSIPS and wait for some time, usually within several hours it starts to display this number.

Expected behavior

Active dialogs statistic value should be showing correct number of active dialogs.

Relevant System Logs

OS/environment information

  • Operating System: Ubuntu 20.04.6
  • OpenSIPS installation: built from source by myself
  • other relevant information:

Additional context

@Lt-Flash
Copy link
Author

Lt-Flash commented Mar 1, 2024

Yesterday I've updated OpenSIPS to 3.4.4 and today the same issue has happened again.

OpenSIPS version:

version: opensips 3.4.4 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-FUTEX-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 0a7c2f2ed
main.c compiled on 09:31:12 Feb 29 2024 with gcc 9

Graph:
image

opensips-cli -x mi get_statistics dialog::

{
    "dialog:active_dialogs": 1.844674e+19,
    "dialog:early_dialogs": 28,
    "dialog:processed_dialogs": 75001,
    "dialog:expired_dialogs": 0,
    "dialog:failed_dialogs": 40892,
    "dialog:create_sent": 11056,
    "dialog:update_sent": 12668,
    "dialog:delete_sent": 11052,
    "dialog:create_recv": 23045,
    "dialog:update_recv": 27022,
    "dialog:delete_recv": 23037
}

@bogdan-iancu
Copy link
Member

It looks like an underflow here, of the dialog counter. What are the clustering settings for the dialog module?

@Lt-Flash
Copy link
Author

Lt-Flash commented Mar 2, 2024

Hi Bogdan,
Here's a full config of that server, that's a full-sharing cluster of three nodes. Interesting that I have another cluster of three nodes in front of this one and that issue doesn't occur there. Also I have a report in our local Telegram group from someone who has the same issue and they're neither using autoscaler nor cluster at all. Thanks!

####### Global Parameters #########
log_level=3
xlog_level=3
stderror_enabled=no
syslog_enabled=yes
syslog_facility=LOG_LOCAL0

auto_scaling_profile = PROFILE_TCP
     scale up to 32 on 60% for 4 cycles within 5
     scale down to 2 on 10% for 10 cycles

auto_scaling_profile = PROFILE_UDP
     scale up to 32 on 60% for 4 cycles within 5
     scale down to 2 on 10% for 10 cycles

udp_workers=2 use_auto_scaling_profile PROFILE_UDP
tcp_workers=2 use_auto_scaling_profile PROFILE_TCP

#timer_workers=4

auto_aliases=no

socket=bin:10.X.X.X:3857
socket=udp:10.X.X.X:5060 tag vip
socket=hep_udp:10.X.X.X

mhomed = no

dns=no
rev_dns=no
dns_use_search_list=no

server_header="Server: Server"
user_agent_header="User-Agent: UA"

####### Modules Section ########

#set module path
mpath="/opt/opensips/lib64/opensips/modules/"

# MAX ForWarD module
loadmodule "maxfwd.so"

# SIP MSG OPerationS module
loadmodule "sipmsgops.so"

# FIFO Management Interface
loadmodule "mi_fifo.so"
modparam("mi_fifo", "fifo_name", "/opt/opensips/var/opensips_fifo")
modparam("mi_fifo", "fifo_mode", 0666)

# SIGNALING module
loadmodule "signaling.so"

# StateLess module
loadmodule "sl.so"

# Transaction Module
loadmodule "tm.so"
modparam("tm", "fr_timeout", 30)
modparam("tm", "fr_inv_timeout", 90)
modparam("tm", "restart_fr_on_each_reply", 1)
modparam("tm", "onreply_avp_mode", 1)

# TextOPS module
loadmodule "textops.so"

# Record Route Module
loadmodule "rr.so"
modparam("rr", "append_fromtag", 1)
modparam("rr", "enable_double_rr", 0)
modparam("rr", "enable_socket_mismatch_warning", 1)

# USRLoc module
loadmodule "usrloc.so"
modparam("usrloc", "nat_bflag", "NAT")
modparam("usrloc", "working_mode_preset", "full-sharing-cluster")
modparam("usrloc", "use_domain", 0)
modparam("usrloc", "location_cluster", 1)
#modparam("usrloc", "contact_refresh_timer", true)

# Clusterer Module
loadmodule "clusterer.so"
modparam("clusterer", "db_mode", 0)
modparam("clusterer", "my_node_info", "cluster_id=1,node_id=1,url=bin:10.X.X.1:3857,flags=seed")
modparam("clusterer", "neighbor_node_info", "cluster_id=1,node_id=2,url=bin:10.X.X.2:3857,flags=seed")
modparam("clusterer", "neighbor_node_info", "cluster_id=1,node_id=3,url=bin:10.X.X.3:3857,flags=seed")
modparam("clusterer", "seed_fallback_interval", 5)
modparam("clusterer", "ping_interval", 1)
modparam("clusterer", "ping_timeout", 500)
modparam("clusterer", "node_timeout", 10)
modparam("clusterer", "sharing_tag", "vip1/1=active")

# Dialog module
loadmodule "dialog.so"
modparam("dialog", "dialog_replication_cluster", 1)
modparam("dialog", "profile_replication_cluster", 1)
modparam("dialog", "cluster_auto_sync", 1)
modparam("dialog", "reinvite_ping_interval", 30)

# AVPOPS module
loadmodule "avpops.so"

# Math
loadmodule "mathops.so"
modparam ("mathops", "decimal_digits", 3)

# UAC
loadmodule "uac.so"

# AUTH module
loadmodule "auth.so"
modparam("auth", "username_spec", "$var(username)")
modparam("auth", "password_spec", "$var(password)")
modparam("auth", "calculate_ha1", 0)

# Local cache
loadmodule "cachedb_local.so"
modparam("cachedb_local", "cache_clean_period", 10)

# Redis CacheDB
loadmodule "cachedb_redis.so"
modparam("cachedb_redis", "cachedb_url","redis:OS://:CONNECT_STRING/")

# Exec module
loadmodule "exec.so"

# Event Route module
loadmodule "event_route.so"

# Event Routing module
loadmodule "event_routing.so"

# Event Stream
loadmodule "event_stream.so"

# Registrar module
loadmodule "registrar.so"
modparam("registrar", "received_avp", "$avp(received)")
modparam("registrar", "min_expires", 60)
modparam("registrar", "default_expires", 120)
modparam("registrar", "max_expires", 3600)
modparam("registrar", "max_contacts", 1)
modparam("registrar", "attr_avp", "$avp(event_attr)")
modparam("registrar", "pn_enable", true)
modparam("registrar", "pn_providers", "apns, fcm, webpush")
modparam("registrar", "pn_enable_purr", true)

# REST API client
loadmodule "rest_client.so"
modparam("rest_client", "connection_timeout", 1)
modparam("rest_client", "ssl_verifypeer", 0)
modparam("rest_client", "ssl_verifyhost", 0)
modparam("rest_client", "curl_http_version", 4)

# JSON module
loadmodule "json.so"

# HTTPd Server
loadmodule "httpd.so"
modparam("httpd", "ip", "10.X.X.X")
modparam("httpd", "port", XXXX)

# HTTP Management Interface
loadmodule "mi_http.so"
modparam("mi_http", "root", "os_mi_http")

# Prometheus module
loadmodule "prometheus.so"
modparam("prometheus", "statistics", "all")

# Proto UDP
loadmodule "proto_udp.so"

# Proto BIN
loadmodule "proto_bin.so"

# Proto HEP
loadmodule "proto_hep.so"
modparam("proto_hep", "hep_capture_id", 10XXX)
modparam("proto_hep", "hep_id", "[hid]10.X.X.X:9060;transport=udp;version=3")

# Tracer
loadmodule "tracer.so"
modparam("tracer", "trace_on", 1)
modparam("tracer", "trace_id", "[tid]uri=hep:hid")

@rybushkindmitry
Copy link

Hi!

I have similar problems with active dialogs metric.

My setting dialog module

loadmodule "dialog.so"
modparam("dialog", "dlg_match_mode", 1)
modparam("dialog", "default_timeout", 3600) # 1 hours timeout
modparam("dialog", "db_mode", 1)
modparam("dialog", "db_url", "{{ connect_mysql }}")
modparam("dialog", "table_name", "{{ dialog_table }}")
modparam("dialog", "dialog_replication_cluster", {{ cluster_id }})
modparam("dialog", "cachedb_url", "{{ connect_local_dialog }}")
modparam("dialog", "profiles_with_value", "some list of shared profiles;")

version: opensips 3.3.7 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: unknown
main.c compiled on with gcc 10

after update to
version: opensips 3.4.3 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 3a891b9
main.c compiled on with gcc 10

and settings dialog module:
loadmodule "dialog.so"
modparam("dialog", "dlg_match_mode", 1)
modparam("dialog", "default_timeout", 3600) # 1 hours timeout
modparam("dialog", "db_mode", 1)
modparam("dialog", "db_url", "{{ connect_mysql }}")
modparam("dialog", "table_name", "{{ dialog_table }}")
#modparam("dialog", "dialog_replication_cluster", {{ cluster_id }})
modparam("dialog", "cachedb_url", "{{ connect_local_dialog }}")
modparam("dialog", "profiles_with_value", "some list of non shared profiles;")

problem is gone

@Lt-Flash
Copy link
Author

Lt-Flash commented Mar 4, 2024

Yep, but this doesn't work for me in OpenSIPS 3.4.4 :(

@Lt-Flash
Copy link
Author

Lt-Flash commented Mar 7, 2024

Any debug we can do for you, @bogdan-iancu ?

Copy link

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@github-actions github-actions bot added the stale label Mar 22, 2024
@Lt-Flash
Copy link
Author

Noticed that from time to time dialogs return back to normal and then hit the overflow counter again:

image

@github-actions github-actions bot removed the stale label Mar 26, 2024
Copy link

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@github-actions github-actions bot added the stale label Apr 11, 2024
@bogdan-iancu
Copy link
Member

@Lt-Flash , officially 3.3 is EOL, only 3.2 and 3.4 are still maintained. Why moving to 3.4 is such a no-go for you ?

@stale stale bot removed the stale label Apr 11, 2024
@Lt-Flash
Copy link
Author

Bump. Still an issue, can't monitor OpenSIPS properly. Please help someone!

@Lt-Flash
Copy link
Author

@Lt-Flash , officially 3.3 is EOL, only 3.2 and 3.4 are still maintained. Why moving to 3.4 is such a no-go for you ?

Sorry, mate, somehow missed this reply! I have actually moved to 3.4 and the issue still happens, this is from 3.4:

image

Copy link

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@github-actions github-actions bot added the stale label Apr 27, 2024
@Lt-Flash
Copy link
Author

Still doesn't work properly in 3.3.9 and 3.4.4.

@stale stale bot removed the stale label Apr 27, 2024
@bogdan-iancu
Copy link
Member

Could you try a hard reset of your cluster, meaning stopping all opensips nodes in the cluster and while down, purge the dialog tables of all nodes? Restart with basically 0 ongoing calls and see if the issue happens again

@Lt-Flash
Copy link
Author

Hi Bogdan,
I'm not using any DB for dialog tables or registrar or anything at all on these servers, it's a full-sharing cluster with in-memory only storage, no CacheDB except for REDIS, please have a look at my config in the original message.

I tried to do a full restart and that didn't help. Also, I'm already on a 3.4.5 version since yesterday and still getting the same overflow issues:

image

@Lt-Flash
Copy link
Author

Lt-Flash commented May 3, 2024

Update here - I've disabled dialog replication within the cluster and now dialogs do not skyrocket. I can see active dialogs per node, not the total amount, but there are no issues with that. Looks like another bug in cluster sync.

Copy link

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@github-actions github-actions bot added the stale label May 19, 2024
@Lt-Flash
Copy link
Author

Confirmed that if I disable dialog sync within the cluster then the issue goes away so this is definitely something to do with the clusterer module.

@stale stale bot removed the stale label May 19, 2024
Copy link

github-actions bot commented Jun 4, 2024

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@github-actions github-actions bot added the stale label Jun 4, 2024
@Lt-Flash
Copy link
Author

Lt-Flash commented Jun 4, 2024

Still an issue with clusterer enabled

@stale stale bot removed the stale label Jun 4, 2024
Copy link

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@github-actions github-actions bot added the stale label Jun 20, 2024
@Lt-Flash
Copy link
Author

Still an issue

@stale stale bot removed the stale label Jun 21, 2024
Copy link

github-actions bot commented Jul 6, 2024

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@github-actions github-actions bot added the stale label Jul 6, 2024
@Lt-Flash
Copy link
Author

Still an issue with the latest OpenSIPS 3.4.7 build from GIT.

@github-actions github-actions bot removed the stale label Aug 19, 2024
Copy link

github-actions bot commented Sep 3, 2024

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@github-actions github-actions bot added the stale label Sep 3, 2024
@stale stale bot removed the stale label Sep 17, 2024
@Lt-Flash
Copy link
Author

Still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants