-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update from 1.19.2 to 1.26.1 leads to SIGSEGV #4497
Comments
Hi @odoucet . Thanks for reporting this. |
We were able to reproduce it on a new instance that started crashing after a few hours after migration.
We rolled back to 1.26.0 and I will update this ticket if the bug stops happening. |
Same bug on 1.26.0 :
We need 1.26.0 because of Prometheus3 compatibility. |
If it helps, the bug is only happening on replica servers so far. |
yeah, I think it's helpful. if you can, please paste |
Here is the output less than one second before crash. What's strange is "role:master". I guess the node does not know yet there is another master and crashes when it connects to it.
|
Dragonfly command line used :
|
Yes, I agree, that's strange. It aligns with the stack trace we see, though. The crash occurs within the Flush code, which runs when the replica attempts to resynchronize with the master. Is the network connection between your nodes unreliable, cross-region, or over the public internet? |
Not at all, they are on a private local high speed network (< 5ms latency). |
so there are few issues here:
while we are trying to solve (1), I suggest that you run your servers with additional argument Also, if you can, please attach the INFO log from your master. |
@odoucet |
Hi @odoucet, do you know if the replica received incoming traffic (like |
You can see it in "info all" that there is only auth/info commands executed (my tests), and 2 "ping" (from dragonfly operator I guess). |
I am able to reproduce this issue, with a big caveat. So the instructions to reproduce are:
I remember that we had some issue a while back that could cause the same key to be saved multiple times into an RDB file. I don't know if it was in 1.19 or not. But in any case this is not our case, as the warning I think that this is a good direction though (I've been looking into it since this morning). We need to figure out in which case loading and RDB could keep an entry in |
To clarify, this happens only with dragonfly/src/server/db_slice.cc Line 484 in 69ef997
So in the scenario I think about, we call |
it's a good direction @chakaz . i know that folks that have multiple snapshots on their disk succeed loading files from multiple files sets, i.e. dragonfly can load an dfs file from another snapshot set. this may happen when a process had differrent thread configurations and differrent arities of files were created, if I remember correctly. |
why do we BumpUp during the rdb load? |
(I think we should not bump up during the load) |
We bump up just because we use the standard API, |
@odoucet can you share the command you use to set up the replication? |
@chakaz they use dragonfly operator, and it does not use |
The bug is actually triggering between two 1.26.1 instances (master/replica), but the dfs file was generated on 1.19 though. on replica :
on master, here are the events:
|
Ah interesting ! Part of the update from 1.19 to 1.26 was to explicit this flag. I think it was not set before. |
Can I add this argument on all instances (I'm using dragonfly operator) ? Can you tell me what they do ? Dragonfly documentation is a bit lacky (or I'm not reading the correct page :p). |
they just control the verbosity level of relevant modules in dragonfly. you should see more logs. Yes, you can add them to all instances. |
Describe the bug
We tried upgrading a dragonfly cluster (two members) from v1.19.2 to v1.26.1
We restarted the replica instance, and here is the log after reboot :
This happens in a loop.
If we delete snapshot, a full sync happens and member starts correctly :
On another cluster, even deleting snapshot and full resync failed (100% failure after 10+ restarts) :
on master instance, at the same moment :
In this case we were forced to start over and setup a new empty cluster.
To Reproduce
This was reproduced in FOUR different Dragonfly cluster, with same update (same version source/destination).
Every time but one, deleting dfs file makes it work.
Expected behavior
Working update, or at least a full resync from master without crash.
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: