Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update from 1.19.2 to 1.26.1 leads to SIGSEGV #4497

Open
odoucet opened this issue Jan 22, 2025 · 25 comments
Open

Update from 1.19.2 to 1.26.1 leads to SIGSEGV #4497

odoucet opened this issue Jan 22, 2025 · 25 comments
Assignees
Labels
bug Something isn't working urgent Important issue that needs to be fixed asap

Comments

@odoucet
Copy link

odoucet commented Jan 22, 2025

Describe the bug
We tried upgrading a dragonfly cluster (two members) from v1.19.2 to v1.26.1
We restarted the replica instance, and here is the log after reboot :

2025-01-22T15:15:21+01:00 I20250122 14:15:21.933689     1 init.cc:78] dragonfly running in opt mode.
2025-01-22T15:15:21+01:00 I20250122 14:15:21.933818     1 dfly_main.cc:691] Starting dragonfly df-v1.26.1-c2b95415241debb99bf969ebbc8465ce0bfe206f
2025-01-22T15:15:21+01:00 I20250122 14:15:21.934052     1 dfly_main.cc:735] maxmemory has not been specified. Deciding myself....
2025-01-22T15:15:21+01:00 I20250122 14:15:21.934062     1 dfly_main.cc:744] Found 3.00GiB available memory. Setting maxmemory to 2.40GiB
2025-01-22T15:15:21+01:00 W20250122 14:15:21.934125     1 dfly_main.cc:368] Weird error 1 switching to epoll
2025-01-22T15:15:22+01:00 I20250122 14:15:22.014271     1 proactor_pool.cc:147] Running 1 io threads
2025-01-22T15:15:22+01:00 I20250122 14:15:22.017419     1 dfly_main.cc:272] Listening on admin socket any:9999
2025-01-22T15:15:22+01:00 I20250122 14:15:22.018528     1 server_family.cc:835] Host OS: Linux 5.14.0-480.el9.x86_64 x86_64 with 1 threads
2025-01-22T15:15:22+01:00 I20250122 14:15:22.022305     1 snapshot_storage.cc:185] Load snapshot: Searching for snapshot in directory: "/dragonfly/snapshots"
2025-01-22T15:15:22+01:00 I20250122 14:15:22.025554     1 server_family.cc:1114] Loading /dragonfly/snapshots/dump-summary.dfs
2025-01-22T15:15:22+01:00 I20250122 14:15:22.034510     6 listener_interface.cc:101] sock[5] AcceptServer - listening on port 9999
2025-01-22T15:15:22+01:00 I20250122 14:15:22.034538     6 listener_interface.cc:101] sock[6] AcceptServer - listening on port 6379
2025-01-22T15:15:22+01:00 I20250122 14:15:22.034548     6 listener_interface.cc:101] sock[7] AcceptServer - listening on port 11211
2025-01-22T15:15:23+01:00 I20250122 14:15:23.489850     6 server_family.cc:1154] Load finished, num keys read: 323
2025-01-22T15:15:34+01:00 I20250122 14:15:34.925325     6 server_family.cc:2769] Replicating 10.233.130.142:9999
2025-01-22T15:15:34+01:00 F20250122 14:15:34.925524     6 db_slice.cc:783] Check failed: fetched_items_.empty() 
2025-01-22T15:15:34+01:00 *** Check failure stack trace: ***
2025-01-22T15:15:34+01:00     @     0x55c033cde923  google::LogMessage::SendToLog()
2025-01-22T15:15:34+01:00     @     0x55c033cd70e7  google::LogMessage::Flush()
2025-01-22T15:15:34+01:00     @     0x55c033cd8a6f  google::LogMessageFatal::~LogMessageFatal()
2025-01-22T15:15:34+01:00     @     0x55c03351770f  dfly::DbSlice::FlushDbIndexes()
2025-01-22T15:15:34+01:00     @     0x55c0335178a2  dfly::DbSlice::FlushDb()
2025-01-22T15:15:34+01:00     @     0x55c033296f2a  _ZN4absl12lts_2024011619functional_internal12InvokeObjectIZN4dfly12ServerFamily8DrakarysEPNS3_11TransactionEtEUlS6_PNS3_11EngineShardEE_NS5_14RunnableResultEJS6_S8_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
2025-01-22T15:15:34+01:00     @     0x55c0335577b9  dfly::Transaction::RunCallback()
2025-01-22T15:15:34+01:00     @     0x55c03355a59b  dfly::Transaction::RunInShard()
2025-01-22T15:15:34+01:00     @     0x55c033498790  dfly::EngineShard::PollExecution()
2025-01-22T15:15:34+01:00     @     0x55d7633db3a1  _ZNSt17_Function_handlerIFvvEZN4dfly11Transaction11DispatchHopEvEUlvE1_E9_M_invokeERKSt9_Any_data
2025-01-22T15:15:34+01:00     @     0x55d76394cf55  util::fb2::FiberQueue::Run()
2025-01-22T15:15:34+01:00     @     0x55d76342d730  _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZN4dfly9TaskQueue5StartESt17basic_string_viewIcSt11char_traitsIcEEEUlvE_JEEC4IS7_EESF_RKNS0_12preallocatedEOT_OSG_EUlOS4_E_EEEEvNS1_10transfer_tE
2025-01-22T15:15:34+01:00     @     0x55d76396c25f  make_fcontext
2025-01-22T15:15:34+01:00     *** SIGABRT received at time=1737556484 on cpu 0 ***
2025-01-22T15:15:34+01:00     PC: @     0x7f1faca299fc  (unknown)  pthread_kill
2025-01-22T15:15:34+01:00     [failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7f1fac9bb898 while already in AbslFailureSignalHandler()
2025-01-22T15:15:34+01:00     *** SIGSEGV received at time=1737556484 on cpu 0 ***
2025-01-22T15:15:34+01:00     PC: @     0x7f1fac9bb898  (unknown)  abort

This happens in a loop.

If we delete snapshot, a full sync happens and member starts correctly :

I20250122 14:18:26.534972     1 dfly_main.cc:691] Starting dragonfly df-v1.26.1-c2b95415241debb99bf969ebbc8465ce0bfe206f
I20250122 14:18:26.535194     1 dfly_main.cc:735] maxmemory has not been specified. Deciding myself....
I20250122 14:18:26.535209     1 dfly_main.cc:744] Found 3.00GiB available memory. Setting maxmemory to 2.40GiB
W20250122 14:18:26.535251     1 dfly_main.cc:368] Weird error 1 switching to epoll
I20250122 14:18:26.612828     1 proactor_pool.cc:147] Running 1 io threads
I20250122 14:18:26.615834     1 dfly_main.cc:272] Listening on admin socket any:9999
I20250122 14:18:26.616796     1 server_family.cc:835] Host OS: Linux 5.14.0-480.el9.x86_64 x86_64 with 1 threads
I20250122 14:18:26.620615     1 snapshot_storage.cc:185] Load snapshot: Searching for snapshot in directory: "/dragonfly/snapshots"
W20250122 14:18:26.623203     1 server_family.cc:950] Load snapshot: No snapshot found
I20250122 14:18:26.630414     6 listener_interface.cc:101] sock[5] AcceptServer - listening on port 9999
I20250122 14:18:26.630442     6 listener_interface.cc:101] sock[6] AcceptServer - listening on port 6379
I20250122 14:18:26.630453     6 listener_interface.cc:101] sock[7] AcceptServer - listening on port 11211
I20250122 14:18:46.526578     6 server_family.cc:2769] Replicating 10.233.130.142:9999
I20250122 14:18:47.943701     6 replica.cc:569] Started full sync with 10.233.130.142:9999
I20250122 14:18:49.127570     6 replica.cc:589] full sync finished in 2.6 s
I20250122 14:18:49.127673     6 replica.cc:679] Transitioned into stable sync

On another cluster, even deleting snapshot and full resync failed (100% failure after 10+ restarts) :

I20250122 14:36:19.137816     1 server_family.cc:835] Host OS: Linux 5.14.0-480.el9.x86_64 x86_64 with 1 threads
I20250122 14:36:19.149236     1 snapshot_storage.cc:185] Load snapshot: Searching for snapshot in directory: "/dragonfly/snapshots"
W20250122 14:36:19.152247     1 server_family.cc:950] Load snapshot: No snapshot found
I20250122 14:36:19.160511     6 listener_interface.cc:101] sock[5] AcceptServer - listening on port 9999
I20250122 14:36:19.160552     6 listener_interface.cc:101] sock[6] AcceptServer - listening on port 6379
I20250122 14:36:19.160574     6 listener_interface.cc:101] sock[7] AcceptServer - listening on port 11211
I20250122 14:36:37.548327     6 server_family.cc:2769] Replicating 10.233.131.239:9999
I20250122 14:36:37.560622     6 replica.cc:569] Started full sync with 10.233.131.239:9999
I20250122 14:36:37.593991     6 server_family.cc:2769] Replicating 10.233.131.239:9999
W20250122 14:36:37.594259     6 replica.cc:246] Error syncing with 10.233.131.239:9999 generic:125 Operation canceled
I20250122 14:36:37.660379     6 replica.cc:569] Started full sync with 10.233.131.239:9999
I20250122 14:36:39.770289     6 server_family.cc:2769] Replicating 10.233.131.239:9999
W20250122 14:36:39.770624     6 replica.cc:246] Error syncing with 10.233.131.239:9999 generic:125 Operation canceled
F20250122 14:36:39.775619     6 db_slice.cc:783] Check failed: fetched_items_.empty() 
*** Check failure stack trace: ***
    @     0x55681cb35923  google::LogMessage::SendToLog()
    @     0x55681cb2e0e7  google::LogMessage::Flush()
    @     0x55681cb2fa6f  google::LogMessageFatal::~LogMessageFatal()
    @     0x55681c36e70f  dfly::DbSlice::FlushDbIndexes()
    @     0x55681c36e8a2  dfly::DbSlice::FlushDb()
    @     0x55681c0edf2a  _ZN4absl12lts_2024011619functional_internal12InvokeObjectIZN4dfly12ServerFamily8DrakarysEPNS3_11TransactionEtEUlS6_PNS3_11EngineShardEE_NS5_14RunnableResultEJS6_S8_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
    @     0x55681c3ae7b9  dfly::Transaction::RunCallback()
    @     0x55681c3b159b  dfly::Transaction::RunInShard()
    @     0x55681c2ef790  dfly::EngineShard::PollExecution()
    @     0x55681c3aa3a1  _ZNSt17_Function_handlerIFvvEZN4dfly11Transaction11DispatchHopEvEUlvE1_E9_M_invokeERKSt9_Any_data
    @     0x55681c91bf55  util::fb2::FiberQueue::Run()
    @     0x55681c3fc730  _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZN4dfly9TaskQueue5StartESt17basic_string_viewIcSt11char_traitsIcEEEUlvE_JEEC4IS7_EESF_RKNS0_12preallocatedEOT_OSG_EUlOS4_E_EEEEvNS1_10transfer_tE
    @     0x55681c93b25f  make_fcontext
*** SIGABRT received at time=1737556599 on cpu 0 ***
PC: @     0x7f4768cc29fc  (unknown)  pthread_kill
[failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7f4768c54898 while already in AbslFailureSignalHandler()

on master instance, at the same moment :

I20250122 14:36:37.550398     6 dflycmd.cc:651] Registered replica 10.233.130.163:6379
I20250122 14:36:37.559907     6 dflycmd.cc:345] Started sync with replica 10.233.130.163:6379
I20250122 14:36:37.648327     6 dflycmd.cc:110] Disconnecting from replica 10.233.130.163:6379
I20250122 14:36:37.648401     6 rdb_save.cc:1247] Error writing to rdb sink Broken pipe
I20250122 14:36:37.648496     6 dflycmd.cc:641] Replication error: Operation canceled: Context cancelled
I20250122 14:36:37.650457     6 dflycmd.cc:651] Registered replica 10.233.130.163:6379
I20250122 14:36:37.659725     6 dflycmd.cc:345] Started sync with replica 10.233.130.163:6379
I20250122 14:36:39.784649     6 rdb_save.cc:1247] Error writing to rdb sink Broken pipe
I20250122 14:36:39.784729     6 dflycmd.cc:641] Replication error: Broken pipe
I20250122 14:36:39.784785     6 dflycmd.cc:110] Disconnecting from replica 10.233.130.163:6379

In this case we were forced to start over and setup a new empty cluster.

To Reproduce
This was reproduced in FOUR different Dragonfly cluster, with same update (same version source/destination).
Every time but one, deleting dfs file makes it work.

Expected behavior
Working update, or at least a full resync from master without crash.

Environment (please complete the following information):

  • Dragonfly official image docker.dragonflydb.io/dragonflydb/dragonfly
  • Run in Kubernetes 1.30.4
@odoucet odoucet added the bug Something isn't working label Jan 22, 2025
@adiholden
Copy link
Collaborator

Hi @odoucet . Thanks for reporting this.

@odoucet
Copy link
Author

odoucet commented Jan 22, 2025

We were able to reproduce it on a new instance that started crashing after a few hours after migration.
Stacktrace is the same :

F20250122 16:47:58.648206     7 db_slice.cc:783] Check failed: fetched_items_.empty()
*** Check failure stack trace: ***
    @     0x56088fa69923  google::LogMessage::SendToLog()
    @     0x56088fa620e7  google::LogMessage::Flush()
    @     0x56088fa63a6f  google::LogMessageFatal::~LogMessageFatal()
    @     0x56088f2a270f  dfly::DbSlice::FlushDbIndexes()
    @     0x56088f2a28a2  dfly::DbSlice::FlushDb()
    @     0x56088f021f2a  _ZN4absl12lts_2024011619functional_internal12InvokeObjectIZN4dfly12ServerFamily8DrakarysEPNS3_11TransactionEtEUlS6_PNS3_11EngineShardEE_NS5_14RunnableResultEJS6_S8_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
    @     0x56088f2e27b9  dfly::Transaction::RunCallback()
    @     0x56088f2e559b  dfly::Transaction::RunInShard()
    @     0x56088f223790  dfly::EngineShard::PollExecution()
    @     0x56088f2de3a1  _ZNSt17_Function_handlerIFvvEZN4dfly11Transaction11DispatchHopEvEUlvE1_E9_M_invokeERKSt9_Any_data
    @     0x56088f84ff55  util::fb2::FiberQueue::Run()
    @     0x56088f330730  _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZN4dfly9TaskQueue5StartESt17basic_string_viewIcSt11char_traitsIcEEEUlvE_JEEC4IS7_EESF_RKNS0_12preallocatedEOT_OSG_EUlOS4_E_EEEEvNS1_10transfer_tE
    @     0x56088f86f25f  make_fcontext
*** SIGABRT received at time=1737564478 on cpu 1 ***
PC: @     0x7fcdf39739fc  (unknown)  pthread_kill
[failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7fcdf3905898 while already in AbslFailureSignalHandler()
*** SIGSEGV received at time=1737564478 on cpu 1 ***
PC: @     0x7fcdf3905898  (unknown)  abort

We rolled back to 1.26.0 and I will update this ticket if the bug stops happening.

@romange romange added the urgent Important issue that needs to be fixed asap label Jan 22, 2025
@odoucet
Copy link
Author

odoucet commented Jan 22, 2025

Same bug on 1.26.0 :

I20250122 19:23:05.273779     6 server_family.cc:1154] Load finished, num keys read: 193674
I20250122 19:23:05.452224     7 version_monitor.cc:174] Your current version '1.26.0' is not the latest version. A newer version '1.26.1' is now available. Please consider an update.
I20250122 19:23:24.684263     7 server_family.cc:2765] Replicating 10.233.137.104:9999
F20250122 19:23:24.684446     7 db_slice.cc:783] Check failed: fetched_items_.empty() 
*** Check failure stack trace: ***
    @     0x55a599c55923  google::LogMessage::SendToLog()
    @     0x55a599c4e0e7  google::LogMessage::Flush()
    @     0x55a599c4fa6f  google::LogMessageFatal::~LogMessageFatal()
    @     0x55a5994900cf  dfly::DbSlice::FlushDbIndexes()
    @     0x55a599490262  dfly::DbSlice::FlushDb()
    @     0x55a59920fefa  _ZN4absl12lts_2024011619functional_internal12InvokeObjectIZN4dfly12ServerFamily8DrakarysEPNS3_11TransactionEtEUlS6_PNS3_11EngineShardEE_NS5_14RunnableResultEJS6_S8_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
    @     0x55a5994d00e9  dfly::Transaction::RunCallback()
    @     0x55a5994d2ecb  dfly::Transaction::RunInShard()
    @     0x55a599411160  dfly::EngineShard::PollExecution()
    @     0x55a5994cbcd1  _ZNSt17_Function_handlerIFvvEZN4dfly11Transaction11DispatchHopEvEUlvE1_E9_M_invokeERKSt9_Any_data
    @     0x55a599a3ba75  util::fb2::FiberQueue::Run()
    @     0x55a59951dbf0  _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZN4dfly9TaskQueue5StartESt17basic_string_viewIcSt11char_traitsIcEEEUlvE_JEEC4IS7_EESF_RKNS0_12preallocatedEOT_OSG_EUlOS4_E_EEEEvNS1_10transfer_tE
    @     0x55a599a5ad7f  make_fcontext
*** SIGABRT received at time=1737573804 on cpu 1 ***
PC: @     0x7f3f62d2b9fc  (unknown)  pthread_kill
[failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7f3f62cbd898 while already in AbslFailureSignalHandler()
*** SIGSEGV received at time=1737573804 on cpu 1 ***
PC: @     0x7f3f62cbd898  (unknown)  abort

We need 1.26.0 because of Prometheus3 compatibility.

@odoucet
Copy link
Author

odoucet commented Jan 23, 2025

If it helps, the bug is only happening on replica servers so far.

@romange
Copy link
Collaborator

romange commented Jan 23, 2025

yeah, I think it's helpful. if you can, please paste info all output from the replica

@odoucet
Copy link
Author

odoucet commented Jan 23, 2025

Here is the output less than one second before crash. What's strange is "role:master". I guess the node does not know yet there is another master and crashes when it connects to it.

# Server
redis_version:7.2.0
dragonfly_version:df-v1.26.1
redis_mode:standalone
arch_bits:64
os:Linux 5.14.0-480.el9.x86_64 x86_64
thread_count:1
multiplexing_api:epoll
tcp_port:6379
uptime_in_seconds:17
uptime_in_days:0

# Clients
connected_clients:1
max_clients:64000
client_read_buffer_bytes:256
blocked_clients:0
pipeline_queue_length:0
send_delay_ms:0

# Memory
used_memory:2160514320
used_memory_human:2.01GiB
used_memory_peak:2160514320
used_memory_peak_human:2.01GiB
fibers_stack_vms:425776
fibers_count:13
used_memory_rss:2260889600
used_memory_rss_human:2.11GiB
used_memory_peak_rss:2431455232
maxmemory:2576980377
maxmemory_human:2.40GiB
used_memory_lua:0
object_used_memory:36549296
type_used_memory_string:36549296
table_used_memory:281904
num_buckets:480
num_entries:719
inline_keys:3
listpack_blobs:0
listpack_bytes:0
small_string_bytes:45872
pipeline_cache_bytes:0
dispatch_queue_bytes:0
dispatch_queue_subscriber_bytes:0
dispatch_queue_peak_bytes:0
client_read_buffer_peak_bytes:256
tls_bytes:5664
snapshot_serialization_bytes:0
commands_squashing_replies_bytes:0
cache_mode:cache
maxmemory_policy:eviction
replication_streaming_buffer_bytes:0
replication_full_sync_buffer_bytes:0

# Stats
total_connections_received:16
total_commands_processed:30
instantaneous_ops_per_sec:3
total_pipelined_commands:0
total_pipelined_squashed_commands:0
pipeline_throttle_total:0
pipelined_latency_usec:0
total_net_input_bytes:1232
connection_migrations:0
total_net_output_bytes:57687
rdb_save_usec:0
rdb_save_count:0
big_value_preemptions:0
compressed_blobs:0
instantaneous_input_kbps:-1
instantaneous_output_kbps:-1
rejected_connections:-1
expired_keys:0
evicted_keys:0
hard_evictions:0
garbage_checked:0
garbage_collected:0
bump_ups:21
stash_unloaded:0
oom_rejections:0
traverse_ttl_sec:628
delete_ttl_sec:0
keyspace_hits:0
keyspace_misses:0
keyspace_mutations:964
total_reads_processed:28
total_writes_processed:29
defrag_attempt_total:0
defrag_realloc_total:0
defrag_task_invocation_total:0
reply_count:29
reply_latency_usec:0
blocked_on_interpreter:0
lua_interpreter_cnt:0
lua_blocked_total:0

# Tiered
tiered_entries:0
tiered_entries_bytes:0
tiered_total_stashes:0
tiered_total_fetches:0
tiered_total_cancels:0
tiered_total_deletes:0
tiered_total_uploads:0
tiered_total_stash_overflows:0
tiered_heap_buf_allocations:0
tiered_registered_buf_allocations:0
tiered_allocated_bytes:0
tiered_capacity_bytes:0
tiered_pending_read_cnt:0
tiered_pending_stash_cnt:0
tiered_small_bins_cnt:0
tiered_small_bins_entries_cnt:0
tiered_small_bins_filling_bytes:0
tiered_cold_storage_bytes:0
tiered_offloading_steps:0
tiered_offloading_stashes:0
tiered_ram_hits:0
tiered_ram_cool_hits:0
tiered_ram_misses:0

# Persistence
current_snapshot_perc:0
current_save_keys_processed:0
current_save_keys_total:0
last_success_save:1737623316
last_saved_file:
last_success_save_duration_sec:0
loading:0
saving:0
current_save_duration_sec:0
rdb_changes_since_last_success_save:964
last_failed_save:0
last_error:
last_failed_save_duration_sec:0

# Transaction
tx_shard_polls:0
tx_shard_optimistic_total:0
tx_shard_ooo_total:0
tx_global_total:0
tx_normal_total:0
tx_inline_runs_total:0
tx_schedule_cancel_total:0
tx_batch_scheduled_items_total:0
tx_batch_schedule_calls_total:0
tx_with_freq:0
tx_queue_len:0
eval_io_coordination_total:0
eval_shardlocal_coordination_total:0
eval_squashed_flushes:0
multi_squash_execution_total:0
multi_squash_execution_hop_usec:0
multi_squash_execution_reply_usec:0

# Replication
role:master
connected_slaves:0
master_replid:985f47f8b8ca3cd5f6f8a75bd48f0e2b874227b2

# Commandstats
cmdstat_auth:calls=14,usec=831,usec_per_call=59.3571
cmdstat_info:calls=13,usec=2752,usec_per_call=211.692
cmdstat_ping:calls=2,usec=28,usec_per_call=14

# Modules
module:name=ReJSON,ver=20000,api=1,filters=0,usedby=[search],using=[],options=[handle-io-errors]
module:name=search,ver=20000,api=1,filters=0,usedby=[],using=[ReJSON],options=[handle-io-errors]

# Search
search_memory:0
search_num_indices:0
search_num_entries:0

# Errorstats

# Keyspace
db0:keys=719,expires=350,avg_ttl=-1

# Cpu
used_cpu_sys:1.584828
used_cpu_user:1.978862
used_cpu_sys_children:0.2876
used_cpu_user_children:0.3150
used_cpu_sys_main_thread:1.381782
used_cpu_user_main_thread:1.889157

# Cluster
cluster_enabled:0

@odoucet
Copy link
Author

odoucet commented Jan 23, 2025

Dragonfly command line used :

--alsologtostderr --primary_port_http_enabled=false --admin_port=9999 --admin_nopass --dbfilename=dump --cache_mode=true --memcached_port=11211 --dir=/dragonfly/snapshots --snapshot_cron=*/5 * * * *

@romange
Copy link
Collaborator

romange commented Jan 23, 2025

Yes, I agree, that's strange. It aligns with the stack trace we see, though. The crash occurs within the Flush code, which runs when the replica attempts to resynchronize with the master. Is the network connection between your nodes unreliable, cross-region, or over the public internet?

@odoucet
Copy link
Author

odoucet commented Jan 23, 2025

Yes, I agree, that's strange. It aligns with the stack trace we see, though. The crash occurs within the Flush code, which runs when the replica attempts to resynchronize with the master. Is the network connection between your nodes unreliable, cross-region, or over the public internet?

Not at all, they are on a private local high speed network (< 5ms latency).

@romange
Copy link
Collaborator

romange commented Jan 23, 2025

so there are few issues here:

  1. severe - we crash upon reload
  2. less severe but still troubling - replica disconnects during the replication.

while we are trying to solve (1), I suggest that you run your servers with additional argument --vmodule=streamer=1,dflycmd=1,replica=1

Also, if you can, please attach the INFO log from your master.

@adiholden
Copy link
Collaborator

@odoucet
Did you see the replica reconnecting only when replicating from master version v1.19 or also when replicating from master version v1.26?
This errors are related to reconnecting
W20250122 14:36:37.594259 6 replica.cc:246] Error syncing with 10.233.131.239:9999 generic:125 Operation canceled

@chakaz
Copy link
Collaborator

chakaz commented Jan 23, 2025

Hi @odoucet, do you know if the replica received incoming traffic (like GET commands etc) during the time it started to replicate?
What I mean by that is in parallel to handling requests it also tried to replicate a master

@odoucet
Copy link
Author

odoucet commented Jan 23, 2025

Hi @odoucet, do you know if the replica received incoming traffic (like GET commands etc) during the time it started to replicate? What I mean by that is in parallel to handling requests it also tried to replicate a master

You can see it in "info all" that there is only auth/info commands executed (my tests), and 2 "ping" (from dragonfly operator I guess).

@chakaz
Copy link
Collaborator

chakaz commented Jan 23, 2025

I am able to reproduce this issue, with a big caveat.
If the RDB file has recurring key (i.e. key that appears twice), then this will put an entry in fetched_items_, which will never be cleared because it is called outside of a command context (we clear fetched_items_ in DbSlice::OnCbFinish).

So the instructions to reproduce are:

  1. Create RDB with the same key twice (not trivial, I manually modified the file)
  2. Load it via --dbfilename, such that it is loaded outside the context of a command
  3. Call FLUSHALL (and likely also REPLICAOF) immediately after

I remember that we had some issue a while back that could cause the same key to be saved multiple times into an RDB file. I don't know if it was in 1.19 or not. But in any case this is not our case, as the warning W20250123 14:19:17.709569 463838 rdb_load.cc:2620] RDB has duplicated key 'x123456' in DB 0 is not printed in @odoucet 's case.

I think that this is a good direction though (I've been looking into it since this morning). We need to figure out in which case loading and RDB could keep an entry in fetched_items_ even without it appearing twice. Still looking, but if @adiholden / @romange have other ideas lmk

@chakaz
Copy link
Collaborator

chakaz commented Jan 23, 2025

To clarify, this happens only with --cache_mode in the line:

auto bump_it = db.prime.BumpUp(res.it, PrimeBumpPolicy{&fetched_items_});

So in the scenario I think about, we call FindInternal() in the context of load, and the key already exists.

@romange
Copy link
Collaborator

romange commented Jan 23, 2025

it's a good direction @chakaz . i know that folks that have multiple snapshots on their disk succeed loading files from multiple files sets, i.e. dragonfly can load an dfs file from another snapshot set. this may happen when a process had differrent thread configurations and differrent arities of files were created, if I remember correctly.

@romange
Copy link
Collaborator

romange commented Jan 23, 2025

why do we BumpUp during the rdb load?

@romange
Copy link
Collaborator

romange commented Jan 23, 2025

(I think we should not bump up during the load)

@chakaz
Copy link
Collaborator

chakaz commented Jan 23, 2025

We bump up just because we use the standard API, FindOrUpdate() etc, during the load process. It's not explicit...

@chakaz
Copy link
Collaborator

chakaz commented Jan 23, 2025

@odoucet can you share the command you use to set up the replication? REPLICAOF <host> <port> or something else?
(Specifically I want to see that you're not using ADDREPLICAOF)

@romange
Copy link
Collaborator

romange commented Jan 23, 2025

@chakaz they use dragonfly operator, and it does not use ADDREPLICAOF

@odoucet
Copy link
Author

odoucet commented Jan 23, 2025

@odoucet Did you see the replica reconnecting only when replicating from master version v1.19 or also when replicating from master version v1.26? This errors are related to reconnecting W20250122 14:36:37.594259 6 replica.cc:246] Error syncing with 10.233.131.239:9999 generic:125 Operation canceled

The bug is actually triggering between two 1.26.1 instances (master/replica), but the dfs file was generated on 1.19 though.
Last crash :

on replica :

2025-01-23T11:38:10+01:00 I20250123 10:38:10.886876     6 replica.cc:679] Transitioned into stable sync
2025-01-23T11:38:12+01:00 I20250123 10:38:12.898600     6 server_family.cc:2769] Replicating 10.233.130.20:9999
2025-01-23T11:38:12+01:00 I20250123 10:38:12.898733     6 replica.cc:703] Exit stable sync
2025-01-23T11:38:12+01:00 F20250123 10:38:12.899036     6 db_slice.cc:783] Check failed: fetched_items_.empty() 

on master, here are the events:

2025-01-23T11:35:00+01:00 I20250123 10:35:00.777082     6 save_stages_controller.cc:337] Saving "/dragonfly/snapshots/dump-summary.dfs" finished after 1 s
2025-01-23T11:38:10+01:00 I20250123 10:38:10.558867     6 dflycmd.cc:651] Registered replica 10.233.131.97:6379
2025-01-23T11:38:10+01:00 I20250123 10:38:10.723584     6 dflycmd.cc:345] Started sync with replica 10.233.131.97:6379
2025-01-23T11:38:10+01:00 I20250123 10:38:10.886333     6 dflycmd.cc:385] Transitioned into stable sync with replica 10.233.131.97:6379
2025-01-23T11:38:12+01:00 I20250123 10:38:12.899302     6 dflycmd.cc:110] Disconnecting from replica 10.233.131.97:6379
2025-01-23T11:38:12+01:00 I20250123 10:38:12.899538     6 dflycmd.cc:641] Replication error: Operation canceled: Context cancelled
2025-01-23T11:40:00+01:00 I20250123 10:40:00.937175     6 save_stages_controller.cc:337] Saving "/dragonfly/snapshots/dump-summary.dfs" finished after 0 us

@odoucet
Copy link
Author

odoucet commented Jan 23, 2025

To clarify, this happens only with --cache_mode in the line:

dragonfly/src/server/db_slice.cc
Line 484 in 69ef997

Ah interesting ! Part of the update from 1.19 to 1.26 was to explicit this flag. I think it was not set before.

@odoucet
Copy link
Author

odoucet commented Jan 23, 2025

while we are trying to solve (1), I suggest that you run your servers with additional argument --vmodule=streamer=1,dflycmd=1,replica=1

Can I add this argument on all instances (I'm using dragonfly operator) ? Can you tell me what they do ? Dragonfly documentation is a bit lacky (or I'm not reading the correct page :p).

@romange
Copy link
Collaborator

romange commented Jan 23, 2025

while we are trying to solve (1), I suggest that you run your servers with additional argument --vmodule=streamer=1,dflycmd=1,replica=1

Can I add this argument on all instances (I'm using dragonfly operator) ? Can you tell me what they do ? Dragonfly documentation is a bit lacky (or I'm not reading the correct page :p).

they just control the verbosity level of relevant modules in dragonfly. you should see more logs. Yes, you can add them to all instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working urgent Important issue that needs to be fixed asap
Projects
None yet
Development

No branches or pull requests

4 participants