Crashing host process on socket timeout #1690

kenkendk · 2024-12-09T08:56:47Z

FTP Server OS: Linux

FTP Server Type: Unknown

Client Computer OS: Linux

FluentFTP Version: 50.1.0

Framework: .NET8

During a timeout it looks like the shutdown process causes an unhandled exception.

I think the cause is this line:
https://github.com/robinrodricks/FluentFTP/blob/master/FluentFTP/Streams/FtpSocketStream.cs#L560

In here it will attempt to close the connection due to a timeout, but if the socket is already faulty, the close call will fail as well.
Since this happens in the token callback, the uncaught exception has nowhere to go.

Logs :

2024-11-28T10:01:04.428515378Z Unhandled exception. System.TimeoutException: Timed out trying to read data from the socket stream!
2024-11-28T10:01:04.428549934Z    at FluentFTP.FtpSocketStream.ReadAsync(Byte[] buffer, Int32 offset, Int32 count, CancellationToken token)
2024-11-28T10:01:04.428554434Z    at FluentFTP.FtpSocketStream.ReadLineAsync(Encoding encoding, CancellationToken token)
2024-11-28T10:01:04.428565138Z    at FluentFTP.Client.BaseClient.BaseFtpClient.FluentFTP.IInternalFtpClient.GetReplyInternal(CancellationToken token, String command, Boolean exhaustNoop, Int32 timeOut, Boolean useSema)
2024-11-28T10:01:04.428569601Z    at FluentFTP.Client.BaseClient.BaseFtpClient.FluentFTP.IInternalFtpClient.GetReplyInternal(CancellationToken token, String command)
2024-11-28T10:01:04.428572767Z    at FluentFTP.Client.BaseClient.BaseFtpClient.FluentFTP.IInternalFtpClient.CloseDataStreamInternal(FtpDataStream stream, CancellationToken token)
2024-11-28T10:01:04.428576045Z    at FluentFTP.Client.BaseClient.BaseFtpClient.FluentFTP.IInternalFtpClient.CloseDataStreamInternal(FtpDataStream stream, CancellationToken token)
2024-11-28T10:01:04.428579008Z    at FluentFTP.FtpDataStream.CloseAsync(CancellationToken token)
2024-11-28T10:01:04.428581767Z    at FluentFTP.FtpSocketStream.<>c__DisplayClass74_0.<<ReadAsync>b__0>d.MoveNext()
2024-11-28T10:01:04.428585082Z --- End of stack trace from previous location ---
2024-11-28T10:01:04.428587693Z    at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__128_1(Object state)
2024-11-28T10:01:04.428590767Z    at System.Threading.QueueUserWorkItemCallback.Execute()
2024-11-28T10:01:04.428593527Z    at System.Threading.ThreadPoolWorkQueue.Dispatch()
2024-11-28T10:01:04.428596693Z    at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()

The text was updated successfully, but these errors were encountered:

kenkendk · 2024-12-09T09:06:08Z

I will see if the upgrade to 52.0.0 fares better. I can see that the close sequence was updated in the new version.
Note that the ReadAsync method that takes the Span<byte> input is still using the previous logic with the close happening in the CancellationToken.Register callback, where the byte[] method is updated.

I think there is still some logic related to handling the close, because a cancelled (outer) token will cause a different cancellation error than what is intended (will throw from within the close call, not the if statement later).
And in the case the socket is faulty, it will also throw a different exception, hiding the original error.

But at least the call chain is now preserved, and the exception should bubble to the caller.

FanDjango · 2024-12-10T10:14:02Z

Yes, please check this with V52.

As we should be ending up here:

https://github.com/FanDjango/FluentFTP/blob/0087db909a7e1ebac4b43e30cc8bbbb1f334ed1c/FluentFTP/Streams/FtpSocketStream.cs?plain=1#L1822C1-L1840C1

and I would be interested in how that "fares".

kenkendk · 2024-12-22T22:28:48Z

It appears to still happen with 52.0.0, but I have not collected enough details yet:
duplicati/duplicati#5730 (comment)

FanDjango · 2024-12-22T22:30:35Z

Ok, this one will be difficult. Stack trace might be helpful as a starter. I wonder about the exact order of events and methods called on this timeout. Oh, and we do now have some good logging in there also, so perhaps if you can get logs?

kenkendk · 2024-12-31T15:25:17Z

@FanDjango , could it be something with multithreading? Duplicati will have multiple instances of AsyncFtpClient created assuming they are separate instances. If there is some cross-talk between the instances, that could explain perhaps one connection closing another instance?

There is a full stack trace in the issue I linked above, and I cannot see any retry happening inside FluentFTP.
But the error message would make a lot of sense if the connection was somehow closed prior to making the call.

I will try to see if the current "LogToConsole" is sufficient to get deeper into the problem, and add some additional logging options in case it is not.

FanDjango · 2024-12-31T20:16:26Z

You have reported a number of isssues. I need to be sure that we can keep them separate.

#1686 This is classified as a bug, it is pending someone working on it, there is a workaround

#1694 This is classified as a bug, has been fixed in master, pending release, there is a workaround

#1690 (This one) This is not classified yet. Deity only knows what is going on here.

This one:

could it be something with multithreading?

I don't really think we still have holes in thread safety. We did have some in the FluentFTP.GnuTLS component, but only in the beginning and they were discovered right away. And without any logs I cannot even discern if you are using SslStream or GnuTLSStream.

...and I cannot see any retry happening inside FluentFTP.

#1694 was the issue which relates to "retry" activity.

But this issue, for me to try to help: I would need MORE information, especially logging. This would show up the sequence of events, across all client sessions. Note that FluentFTP logs of your clients should have a logger that prefixes the client-identificator in front of the log lines, so that in a multi threaded FTP environment, one can correlate log entries to their client.

So I must reiterate:

#1690 (comment)

During a timeout it looks like the shutdown process causes an unhandled exception.

I will try to recreate this somehow in my own environment.

My analysis of the stack trace you provided (at the top of this) and looking at the code:

(Presumably) a transfer ends. Upload or Download (it is important) is not known to me. A log would tell.
FtpDataStream is closed, higher logic has decided to do this. Why? When we decide a transfer has ended, or it needs to end, we close the data stream.
This data stream close ENTAILS a GetReply read on the socket of the control connection.
And you are right: If this control connection is also closed/broken, we get a failure on that read.

That part of the code you could call "faulty" and this needs to be fixed/addressed (I will do so, actually, but not this year :-) ).

But it would also be very very nice to find out what is going wrong in the first place - unless it is a total eclipse of the net - both connections breaking down simultaneously, maybe.

And that will be the first scenario I will test in order to reproduce this. Not this year, though.

Until then, I hope you can find out more, but I am sure we can make progress in this in the end.

FanDjango · 2024-12-31T20:30:24Z

Specifically, I can see the problem in CloseDataStream.cs:

 			try {
				if (IsConnected) {
					// Because the data connection was closed, if the command that required the data
					// connection was not successful then there will be no reply from the server,
					// however if the command was successful the server will send a reply .
					if (stream.CommandStatus.Type == FtpResponseType.PositivePreliminary) {
						if (!(reply = ((IInternalFtpClient)this).GetReplyInternal(LastCommandExecuted)).Success) {
							throw new FtpCommandException(reply);
						}
					}
				}
			}
			finally {
				// if this is a clone of the original control connection we should Dispose() the entire client
				if (IsClone) {
					((IInternalFtpClient)this).DisconnectInternal();
					((IInternalFtpClient)this).DisposeInternal();
				}
			}

The call to IsConnected needs to be enhanced or probably better, the exception needs to be handled.

Lemme look at this some more in 2025, ok?

kenkendk · 2025-01-01T12:47:25Z

Thanks for taking the time to sort out the mixed report threads.
For this one, my workaround is to simply handle unobserved exceptions, but good if it can get fixed.

Fix #1690

FanDjango · 2025-01-02T11:22:21Z

Fixed in master CloseDataStream.cs. Will be released eventually.

FanDjango added the Investigating label Dec 9, 2024

FanDjango added response requested and removed Investigating labels Dec 10, 2024

kenkendk mentioned this issue Jan 1, 2025

auto-reconnect not connecting when it is needed for capabilites access #1698

Open

FanDjango closed this as completed in 85cf7ac Jan 2, 2025

FanDjango added a commit that referenced this issue Jan 2, 2025

Merge pull request #1699 from FanDjango/master

de101a7

Fix #1690

FanDjango added the fixed label Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashing host process on socket timeout #1690

Crashing host process on socket timeout #1690

kenkendk commented Dec 9, 2024

kenkendk commented Dec 9, 2024

FanDjango commented Dec 10, 2024

kenkendk commented Dec 22, 2024

FanDjango commented Dec 22, 2024 •

edited

Loading

kenkendk commented Dec 31, 2024

FanDjango commented Dec 31, 2024 •

edited

Loading

FanDjango commented Dec 31, 2024

kenkendk commented Jan 1, 2025

FanDjango commented Jan 2, 2025

Crashing host process on socket timeout #1690

Crashing host process on socket timeout #1690

Comments

kenkendk commented Dec 9, 2024

kenkendk commented Dec 9, 2024

FanDjango commented Dec 10, 2024

kenkendk commented Dec 22, 2024

FanDjango commented Dec 22, 2024 • edited Loading

kenkendk commented Dec 31, 2024

FanDjango commented Dec 31, 2024 • edited Loading

FanDjango commented Dec 31, 2024

kenkendk commented Jan 1, 2025

FanDjango commented Jan 2, 2025

FanDjango commented Dec 22, 2024 •

edited

Loading

FanDjango commented Dec 31, 2024 •

edited

Loading