Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Videbridge needs a long time to fix connection problems to the XMPP server. #2052

Open
freym opened this issue Sep 25, 2023 · 8 comments

Comments

@freym
Copy link

freym commented Sep 25, 2023

This Issue tracker is only for reporting bugs and tracking code related issues.

Description

We discovered that the videobridge takes a long time to recover and reconnect if there is an short error in the connection to the XMPP Server (= restart of Prosody, changes in our Network configuration).

Current behavior

JVB 2023-09-25 14:43:46.120 WARNUNG: [136] [hostname=REDACTED id=public-jitsi] MucClient$PingFailedListenerImpl.pingFailed#864: Ping failed, the XMPP connection needs to reconnect.
JVB 2023-09-25 14:43:46.120 WARNUNG: [136] [hostname=REDACTED id=public-jitsi] MucClient$PingFailedListenerImpl.pingFailed#869: XMPP connection still connected, will trigger a disconnect.
JVB 2023-09-25 14:43:50.665 SCHWERWIEGEND: [33] [hostname=REDACTED id=public-jitsi] MucClient$MucWrapper.setPresenceExtensions#802: Failed to send stanza:
org.jivesoftware.smack.SmackException$NotConnectedException: The connection XMPPTCPConnection[[email protected]/un6dsi1n475V] (0) is no longer connected. done=true smResumptionPossible=false
at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.throwNotConnectedExceptionIfDoneAndResumptionNotPossible(XMPPTCPConnection.java:1242)
at org.jivesoftware.smack.tcp.XMPPTCPConnection.throwNotConnectedExceptionIfAppropriate(XMPPTCPConnection.java:351)
at org.jivesoftware.smack.AbstractXMPPConnection.sendStanza(AbstractXMPPConnection.java:858)
at org.jitsi.xmpp.mucclient.MucClient$MucWrapper.setPresenceExtensions(MucClient.java:798)
at org.jitsi.xmpp.mucclient.MucClient.lambda$setPresenceExtensions$5(MucClient.java:470)
at java.base/java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4780)
at org.jitsi.xmpp.mucclient.MucClient.setPresenceExtensions(MucClient.java:470)
at org.jitsi.xmpp.mucclient.MucClient.setPresenceExtension(MucClient.java:454)
at org.jitsi.xmpp.mucclient.MucClientManager.lambda$setPresenceExtension$0(MucClientManager.java:164)
at java.base/java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4780)
at org.jitsi.xmpp.mucclient.MucClientManager.setPresenceExtension(MucClientManager.java:163)
at org.jitsi.videobridge.xmpp.XmppConnection.setPresenceExtension(XmppConnection.kt:85)
at org.jitsi.videobridge.stats.MucStatsTransport.publishStatistics(MucStatsTransport.java:66)
at org.jitsi.videobridge.stats.StatsCollector$TransportPeriodicRunnable.doRun(StatsCollector.kt:148)
at org.jitsi.utils.concurrent.PeriodicRunnableWithObject.run(PeriodicRunnableWithObject.java:87)
at org.jitsi.utils.concurrent.RecurringRunnableExecutor.run(RecurringRunnableExecutor.java:216)
at org.jitsi.utils.concurrent.RecurringRunnableExecutor.runInThread(RecurringRunnableExecutor.java:292)
at org.jitsi.utils.concurrent.RecurringRunnableExecutor$1.run(RecurringRunnableExecutor.java:328)
JVB 2023-09-25 14:43:51.122 INFORMATION: [137] org.jivesoftware.smack.AbstractXMPPConnection.waitForClosingStreamTagFromServer: Exception while waiting for closing stream element from the server XMPPTCPConnection[[email protected]/un6dsi1n475V] (0)
org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 5000ms (~5s). While waiting for closing stream tag from the server [XMPPTCPConnection[[email protected]/un6dsi1n475V] (0)]
at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:99)
at org.jivesoftware.smack.AbstractXMPPConnection.waitForConditionOrThrowConnectionException(AbstractXMPPConnection.java:731)
at org.jivesoftware.smack.AbstractXMPPConnection.waitForClosingStreamTagFromServer(AbstractXMPPConnection.java:1017)
at org.jivesoftware.smack.tcp.XMPPTCPConnection.shutdown(XMPPTCPConnection.java:513)
at org.jivesoftware.smack.tcp.XMPPTCPConnection.shutdown(XMPPTCPConnection.java:493)
at org.jivesoftware.smack.AbstractXMPPConnection.disconnect(AbstractXMPPConnection.java:965)
at org.jivesoftware.smack.AbstractXMPPConnection.disconnect(AbstractXMPPConnection.java:937)
at org.jitsi.xmpp.mucclient.MucClient$PingFailedListenerImpl.pingFailed(MucClient.java:874)
at org.jivesoftware.smackx.ping.PingManager$8.processException(PingManager.java:465)
at org.jivesoftware.smackx.ping.PingManager$8.processException(PingManager.java:448)
at org.jivesoftware.smack.SmackFuture$2.run(SmackFuture.java:191)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)

Now every five seconds the following exception is printed to the logs:

JVB 2023-09-25 14:43:55.665 SCHWERWIEGEND: [33] [hostname=REDACTED id=public-jitsi] MucClient$MucWrapper.setPresenceExtensions#802: Failed to send stanza:
org.jivesoftware.smack.SmackException$NotConnectedException: The connection XMPPTCPConnection[[email protected]/un6dsi1n475V] (0) is no longer connected. done=true smResumptionPossible=false
at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.throwNotConnectedExceptionIfDoneAndResumptionNotPossible(XMPPTCPConnection.java:1242)
at org.jivesoftware.smack.tcp.XMPPTCPConnection.throwNotConnectedExceptionIfAppropriate(XMPPTCPConnection.java:351)
at org.jivesoftware.smack.AbstractXMPPConnection.sendStanza(AbstractXMPPConnection.java:858)
at org.jitsi.xmpp.mucclient.MucClient$MucWrapper.setPresenceExtensions(MucClient.java:798)
at org.jitsi.xmpp.mucclient.MucClient.lambda$setPresenceExtensions$5(MucClient.java:470)
at java.base/java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4780)
at org.jitsi.xmpp.mucclient.MucClient.setPresenceExtensions(MucClient.java:470)
at org.jitsi.xmpp.mucclient.MucClient.setPresenceExtension(MucClient.java:454)
at org.jitsi.xmpp.mucclient.MucClientManager.lambda$setPresenceExtension$0(MucClientManager.java:164)
at java.base/java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4780)
at org.jitsi.xmpp.mucclient.MucClientManager.setPresenceExtension(MucClientManager.java:163)
at org.jitsi.videobridge.xmpp.XmppConnection.setPresenceExtension(XmppConnection.kt:85)
at org.jitsi.videobridge.stats.MucStatsTransport.publishStatistics(MucStatsTransport.java:66)
at org.jitsi.videobridge.stats.StatsCollector$TransportPeriodicRunnable.doRun(StatsCollector.kt:148)
at org.jitsi.utils.concurrent.PeriodicRunnableWithObject.run(PeriodicRunnableWithObject.java:87)
at org.jitsi.utils.concurrent.RecurringRunnableExecutor.run(RecurringRunnableExecutor.java:216)
at org.jitsi.utils.concurrent.RecurringRunnableExecutor.runInThread(RecurringRunnableExecutor.java:292)
at org.jitsi.utils.concurrent.RecurringRunnableExecutor$1.run(RecurringRunnableExecutor.java:328)

After ~ 15 minutes the connection is closed and the videobridge successfully establishes a connection. The videobridge immediately establishes a connection when we manually restart the service. It ist also possible to connect to the XMPP Server with telnet after the error happens.

JVB 2023-09-25 14:57:13.185 INFORMATION: [136] [hostname=REDACTED id=public-jitsi] MucClient$2.connectionClosed#334: Closed.
JVB 2023-09-25 14:57:15.664 WARNUNG: [33] [hostname=REDACTED id=public-jitsi] MucClient.setPresenceExtensions#467: Cannot set presence extension: not connected.
JVB 2023-09-25 14:57:16.072 WARNUNG: [26] [hostname=REDACTED id=public-jitsi] MucClient.lambda$getConnectAndLoginCallable$9#640: Error connecting:
org.jivesoftware.smack.SmackException$SmackWrappedException: java.net.SocketException: Die Verbindung wurde vom Kommunikationspartner zurückgesetzt
at org.jivesoftware.smack.AbstractXMPPConnection.setCurrentConnectionExceptionAndNotify(AbstractXMPPConnection.java:687)
at org.jivesoftware.smack.AbstractXMPPConnection.notifyConnectionError(AbstractXMPPConnection.java:987)
at org.jivesoftware.smack.tcp.XMPPTCPConnection.access$5400(XMPPTCPConnection.java:131)
at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.writePackets(XMPPTCPConnection.java:1419)
at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.access$4200(XMPPTCPConnection.java:1172)
at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter$1.run(XMPPTCPConnection.java:1222)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.net.SocketException: Die Verbindung wurde vom Kommunikationspartner zurückgesetzt
at java.base/sun.nio.ch.NioSocketImpl.implWrite(NioSocketImpl.java:425)
at java.base/sun.nio.ch.NioSocketImpl.write(NioSocketImpl.java:445)
at java.base/sun.nio.ch.NioSocketImpl$2.write(NioSocketImpl.java:831)
at java.base/java.net.Socket$SocketOutputStream.write(Socket.java:1035)
at java.base/sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:234)
at java.base/sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:304)
at java.base/sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)
at java.base/sun.nio.cs.StreamEncoder.write(StreamEncoder.java:132)
at java.base/sun.nio.cs.StreamEncoder.write(StreamEncoder.java:142)
at java.base/java.io.Writer.write(Writer.java:249)
at java.base/java.io.OutputStreamWriter.append(OutputStreamWriter.java:237)
at org.jivesoftware.smack.util.XmlStringBuilder.lambda$write$0(XmlStringBuilder.java:675)
at org.jivesoftware.smack.util.XmlStringBuilder.appendXmlTo(XmlStringBuilder.java:717)
at org.jivesoftware.smack.util.XmlStringBuilder.appendXmlTo(XmlStringBuilder.java:707)
at org.jivesoftware.smack.util.XmlStringBuilder.write(XmlStringBuilder.java:673)
at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.writePackets(XMPPTCPConnection.java:1356)
... 3 more

Expected Behavior

The videobridge should be more should be more tolerant of network errors and recover faster.

Environment details

The videobridge runs on a bare metal server (OS is Debian)
Prosody, Jicofo, Web are running on Kubernetes

Videobridge: 2.3-44-g8983b11f-1
Prosody: stable-8719
Jicofo: stable-8719

@damencho
Copy link
Member

What is the value of net.ipv4.tcp_retries2 in sysconf?

@freym
Copy link
Author

freym commented Sep 25, 2023

$ sysctl net.ipv4.tcp_retries2
net.ipv4.tcp_retries2 = 15

@damencho
Copy link
Member

This means that the kernel will detect broken TCP connections after the 13-15 min mark.
You can try setting it to 7, this will give you roughly 25 to 50 seconds.

@bgrozev
Copy link
Member

bgrozev commented Sep 26, 2023

@freym Can you clarify steps to reproduce? I get a "closed on error" log immediately after restarting prosody (and then a successful reconnect).

@freym
Copy link
Author

freym commented Sep 27, 2023

It happened when we restarted Prosody once, but we now think the problem is related to our network setup and Kubernetes. However, reducing net.ipv4.tcp_retries2 to 7 has alleviated the problem somewhat.

@jbg
Copy link

jbg commented Sep 27, 2023

In k8s it could also happen if the prosody deployment is restarted and the new pod(s) take a long time to go Ready (eg due to readiness or startup probes); if you're using a service DNS name to reach them from JVB then it won't resolve to the new pod IP(s) until they're Ready.

Can also be related to a too-long TTL being served on those service A records (configured in CoreDNS config on most k8s deployments) so JVB still has the old IP for a while.

@bgrozev
Copy link
Member

bgrozev commented Sep 27, 2023

The "Ping failed, the XMPP connection needs to reconnect." message you see early on should trigger a reconnect. I'm surprised that in your case it doesn't log anything about trying to reconnect.

@bgrozev
Copy link
Member

bgrozev commented Sep 27, 2023

I just tested the case where the server stops responding. As expected the XMPP pings timeout in 30-60 seconds and they trigger a re-connect. I suspect it has something to do with your setup (feel free to reopen if you have more info or more questions).

@bgrozev bgrozev closed this as completed Sep 27, 2023
@bgrozev bgrozev reopened this Nov 1, 2023
bgrozev added a commit to bgrozev/jicoco that referenced this issue Nov 6, 2023
Attempts to address jitsi/jitsi-videobridge#2052 by not sending presence
unavailable when disconnecting due to a connection failure. If the TCP
connection is alive then the server will notice that it's closed. If it
isn't alive then it won't matter anyway.
bgrozev added a commit to jitsi/jicoco that referenced this issue Nov 7, 2023
Attempts to address jitsi/jitsi-videobridge#2052 by not sending presence
unavailable when disconnecting due to a connection failure. If the TCP
connection is alive then the server will notice that it's closed. If it
isn't alive then it won't matter anyway.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants