Make StatsRelay detect if StatsD Daemons are Alive #2

jjneely · 2015-04-17T14:09:47Z

The current code base does nothing to detect or react to StatsD daemons that are not alive. The UDP StatsD protocol is designed to be fire-and-forget and offers no way to detect if the other side has received the packet.

StatsD daemons have a TCP administrative interface that's probably very useful for checking if the process is alive. That may be of help with this issue.

Things to think about:

How do we configure this? Command line representation? I'd rather not require a config file if possible -- although I'm not opposed to it.
What do we do with metrics destined for a down StatsD daemon? Buffer them? Probably redirect them to the next available daemon as time stamp information is gathered when the packet is received and isn't in the packet. This may cause inconsistent data in upstream Graphite when/if multiple statsd daemons submit the same metric during hash-ring changes. But probably the least bad situation.
What do we do when all statsd daemons are dead? Log loudly and drop packets?

justdaver · 2015-05-12T14:00:07Z

Initially I was thinking about using something like mon to periodically check if the statsd backend's are up (port check against the statsd admin port?) and if mon detects that a statsd host is down then restart the statsrelay daemon(s) and leave out the host which is down - when it comes back online then restart the statsrelay daemon(s) and include the host again. That said, I really like your idea of including this kind of functionality into statsrelay.

Some ideas/thoughts/2c from my side:

Check every 30 seconds (-t 30 or -time=30) against the statsd admin port (-a 8126 or -adminport=8126) to test if a statsd host is up/down
If down for long periods perhaps buffering metrics won't work so well, rather redirect as you mention. If the down'd host is removed from the hash table completely then I'm not sure if we'd run into issues regarding inconsistent data as you mention? Wouldn't statsrelay still only redirect your metrics to a single statsd daemon? Would have to test this.
On second thought, perhaps buffering metrics could work with a limitation option, Eg buffer the last 50k lines. Logging to a error log would be great too.
Would I be crazy by suggesting a similar admin type port with funtionality like the statsd daemon or would that be too much? Adding / removing statsd hosts on the fly could be usefull for automation and scripting purposes...

Unfortunately I am not much of a programmer.. and my coding kung fu is very weak but will help out with as much as possible on the testing side of things!

denen99 · 2015-06-23T00:29:06Z

I would suggest creating a fixed size memory buffer that just gets overriden. I would also couple that with some sort of a timeout. So buffer X MB of metrics, for Y seconds. Y would be the TTL before you removed the node from the ring and just started sending metrics to another node (as noted above, the least bad situation). When the node comes back up, flush the buffer to the previously used node, add the now up node back to the ring.

jjneely mentioned this issue Apr 17, 2015

Multiple StatsRelay daemons behind LVS #1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make StatsRelay detect if StatsD Daemons are Alive #2

Make StatsRelay detect if StatsD Daemons are Alive #2

jjneely commented Apr 17, 2015

justdaver commented May 12, 2015

denen99 commented Jun 23, 2015

Make StatsRelay detect if StatsD Daemons are Alive #2

Make StatsRelay detect if StatsD Daemons are Alive #2

Comments

jjneely commented Apr 17, 2015

justdaver commented May 12, 2015

denen99 commented Jun 23, 2015