-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_request_too_large[SSL] get stuck on endloop of trying to connect to scylla #261
Comments
I can see it locally (but not 100% of the times), trying to bisect if it's a specific scylla commit that introduced it. |
reproducer:
|
reproduce on:
seems like it's an issue can happen regardless of which scylla version. |
I can't repro this, neither locally (enterprise) or in docker (2023.1). Do we have CI runs where this happened? |
we are seeing it on master, and as I wrote above with enough runs and parallelism of the same test is reproduces locally for me happend on: It's a bit hard to see it on CI, since it case the timeout of the whole tests, but on here you can see the node log of it: |
Well, the log indicates that the CQL connection gets a Why the next connection attempt fails is a mystery. I am leaning on the client side being the issue, but as I still cannot repro this in either relocated or local env, I really don't know. I am also perplexed by the: |
I've run it locally a few times, and my conclusion it's a driver bug failed one:
working one:
looks like there are case the driver doesn't always reconnect after the connection closed by scylla (cause of the too big request) @roydahan @mykaul, can you move this one to scylladb/python-driver ? |
after debug it more on the driver end, seems like it's happening as the following
also, seems like we are doing a query inside |
`set_keyspace_blocking` is called from places holding a lock, and in case that the connection is closed from the server side, it might hang forever. using the `connect_timeout` on it to make sure it won't hang forever. Ref: https://github.com/scylladb/scylladb/issues/15661
Aha, very tricky. |
Any idea why the connection is closed by the server? Of course, the driver should be prepared for it. |
Well the test is sending a huge request on purpose, but the connection getting stuck here isn't the connection that should be closed cause of the too big request. So it's not sure to me yet why we are seeing it in the SSL case only BTW. The hunch is that we recently started using one ssl_context, so it might be affecting it too. Anyhow just putting a timeout on the stuck request is fixing the situation. And each request should have some timeout, especially if used within a lock. Just need to see it doesn't break all other cases. |
and the plot thickens a bit, we recent update the ssl version we are using in this test from and with so there is something relate to the TLS protocol that cause this change of behavior... |
@elcallio any idea what's going on? |
Is there a driver version where things work, and one where it does not? Again, I cannot repro this at all. |
looking at seastar recent changes, this change might be affecting the the closing of connections logic ? |
Cc @xemul |
It happen with the latest driver version 3.26.3, and latest scylla master it happens about 1/10 like that:
you can play with the I can easily reproduce it locally, multiple times like that |
I think that change is safe, unless some code in scylla decided to actually turn off wait-for-eof... |
Running the same, iterations of 10, 20 more... No result what so ever. Same driver, same scylla. Weird. Most likely a network setup difference... |
Maybe different openssl versions |
I did. :-) Inside and outside the dtest container. |
https://github.com/scylladb/scylla-dtest/pull/3670 is hiding this issue for now, as far as I can see. TODO: run the test with tcpdump, so we working network dump, and failing network dump, to compare |
and now with tcpdumps:
the failed dump:
two example of the test passing
|
I figured I've captured only one direction of the communication, so here's one that capture both directions (similar directory structure):
|
can also be download by https:
|
did you had a change to look at those tcp dumps ? |
@fruch - did you manage to decode those?
|
they are irrelevant, this test is intetantilly create hugh requests, so those are excepted, it's not expected for the client to stuck cause of those. |
This test is invalidated in newer scylla. |
test goes into a loop like the following:
The text was updated successfully, but these errors were encountered: