Skip to content

Commit

Permalink
Use truncated exponential backoff for reconnection
Browse files Browse the repository at this point in the history
When the communication with the Riemann server is connected (TCP/TLS)
and the link breaks, riemann-wrapper drop the events that failed to be
send, log a warning, and immediatly try to reconnect and proceed with
remaining data.

When the Riemann server is unreachable because of some network
connectivity issue, this new connection will likely immediatly fail,
freshly gathered events will be dropped, a new warning will be logged,
and a new connection will be tried immediatly.

Because we do not wait before reconnecting, we log an unexpectedly large
amount of information about dropped messages, and because of the delays
introduced by the reconnection attempts, we might be sending stale data
when the connection succeed again.

Rework the deconnection detection logic to apply some truncated
exponential backoff when the connection dies.  Sleep at least 0.5 and at
most 30s between attempts, and drop any pending events before trying to
reconnect.
  • Loading branch information
smortex committed Sep 26, 2023
1 parent 42e3d11 commit ecdf3ab
Showing 1 changed file with 15 additions and 1 deletion.
16 changes: 15 additions & 1 deletion lib/riemann/tools/riemann_client_wrapper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ module Tools
class RiemannClientWrapper
attr_reader :options

BACKOFF_TMIN = 0.5 # Minimum delay between reconnection attempts
BACKOFF_TMAX = 30.0 # Maximum delay
BACKOFF_FACTOR = 2

def initialize(options)
@options = options

Expand All @@ -18,15 +22,25 @@ def initialize(options)

@worker = Thread.new do
Thread.current.abort_on_exception = true
backoff_delay = BACKOFF_TMIN

loop do
events = []

events << @queue.pop
events << @queue.pop while !@queue.empty? && events.size < @max_bulk_size

client.bulk_send(events)
backoff_delay = BACKOFF_TMIN
rescue StandardError => e
warn "Dropping #{events.size} event#{'s' if events.size > 1} due to #{e}"
sleep(backoff_delay)

dropped_count = events.size + @queue.size
@queue.clear
warn "Dropped #{dropped_count} event#{'s' if dropped_count > 1} due to #{e}"

backoff_delay *= BACKOFF_FACTOR
backoff_delay = BACKOFF_TMAX if backoff_delay > BACKOFF_TMAX
end
end

Expand Down

0 comments on commit ecdf3ab

Please sign in to comment.