-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple StatsRelay daemons behind LVS #1
Comments
Dave, That is exactly my use case. Provided that you invoke statsrelay with the same list of hash ring members in the same order on each machine you will get identical behavior from each daemon. So no matter what daemon handles the stat it will always go back to the same StatsD. Jack |
@jjneely you might want to put a version of what you wrote above in the readme. |
Thanks Jack, this is exactly what I've been searching for. I'm going to run a few tests today to see how far I can push it and will report back with some results. This may be unrelated but If I configure statsrelay with 2 backend hosts say A and B, and A goes offline. What happens to the metrics which were meant to go to A? Are they'd queued up or would they be directed to the next backend host B? |
Benchmarks / feedback would be fantastic. I've had fairly reasonable performance with 200,000 metrics/sec at one statsrelay daemon using a load generator: https://github.com/octo/statsd-tg The UDP protocol that StatsD uses is specifically designed not to have any feedback. So just using that protocol statsrelay can't tell if the statsd daemon its talking to is alive. (Nagios steps in here for me.) StatsD does have a TCP port for some administrative commands -- but I imagine the best use of that port is to monitor the process from remote (or in this case statsrelay). That should be its own issue I think as that code hasn't been written yet. In short, the current code base can't tell if the StatsD daemons are alive and will just continue to sent traffic in their direction. |
Issue #2 created. |
Jack - my apologies for the late reply, unfortunately I was sidetracked with work and studies just as I was really getting into testing this. Thanks for the heads up about statsd-tg, works great. I've run into a few small issues with our tests but will send feedback as soon as I can trust our test results :) |
Excellent! I look forward to your feedback. |
Ok so far the tests we're running are looking great. I'm running multiple statsrelay daemons behind an LVS load balancer. The statsrelay daemons are each running with the same options, eg.
On the statsd servers (statsd1/10.0.0.10 and statsd2/10.0.0.20) I can confirm metrics are going to both, and they are both not receiving the same metric namespaces - brilliant :) One thing I have noticed however is that the distribution of metrics to each backend statsd server is not what I expected, most likely because I am still trying to get my head around how consistent hashing works. Would running statsrelay with the options above distribute say 50% of metrics to statsd1 and the other 50% to statsd2? From my tests I'm seeing roughly 20% of metrics hit statsd1 and the remaining 80% hit statsd2. How would I run statsrelay to allow a 50/50 split in metrics the the backend statsd hosts? Thanks! |
By very carefully choosing the hostnames of the statsd locations is the only way to get an even balance across 2 statsd daemons. The hash ring works by setting up a ring buffer with 64k (or some other power of 2) number of slots. The host/port is hashed to figure out where on the ring they live. When a metric name comes in, that name is also hashed to find where it would fall in the ring buffer. Then the algorithm finds the closest slot with a host/port and the metric is sent to that host. So most likely the two slots that represent the hosts in the hash ring are fairly close together which gives you an uneven distribution. The more statsd hosts you have in the ring the more even the distribution will be. I run several statsrelay daemons under LVS and the ring includes 18 statsd daemons. At last check I'm doing about 500,000 statsd metrics per second with room to spare. Although, I haven't looked closely at how even the distribution is...I normally track the CPU usage of the nodejs code. |
Thanks Jack much appreciated. I managed to get an almost event balance by altering the hostnames a bit as you mentioned. At the moment I'm running 6 statsrelay daemons across 2 servers, each doing about 45000 metrics a second (270k/sec total). I'm a bit confused about the numbers reported by the statsProcessed metric statsrelay reports - I'm seeing about figures of 2.3-2.8 million for statsProcessed. Is that total metrics processed per minute? |
StatsRelay sends a counter metric into Statsd to report metrics processed. If your Statsd flush interval is set to 60 seconds then that's probably right. You are seeing metrics processed per minute. This is what I do in Graphite:
|
Ah Ok that makes sense - thanks for clearing that up :) Here are my final test results & info. Fired off some test metrics using statsd-tg (https://github.com/octo/statsd-tg) with the following options:
I'm making use of 2 load balancers, a Citrix NetScaler and using LVS on RHEL 7. Metrics first hit the NetScaler which evenly distributes metrics to 2 RHEL 7 servers (server 1 & 2). The VS on the NetScaler is configured as roundrobin / sessionless / persistence:none. The LVS configuration of server 1 & 2 is (server 1):
Metrics are evenly passed to 3 statsrelay daemons using roundrobin on ports 12001/2/3 UDP. There are 3 statsrelay daemons running with the following options (server 1):
I tested a few different instance values with specifying the statsd hosts and eventually settled on :100 and :400. This seemed to give me the closest to a 50/50 distribution of metrics, more 40/60 but that's good enough for me. For the statsd servers (server 3 & 4) I'm running statsdnet (https://github.com/lukevenediger/statsd.net). The statsrelay servers (server 2 & 3) are VM's running RHEL 7 with 4Gb memory and 4 CPU's. The statsd servers (server 3 & 4) are running Win Server 2012 R2 Std with 4Gb memory and 2 CPU's. Combined metrics processed p/sec by both server 2 & 3 (all statsrelay daemons): 270k/s (135k/s each) I'm also using CollectD to report some basic UDP stats for LVS on the statsrelay servers (the ipvs module is awesome btw) and can see an average of about 3000 UDP InDatagrams/sec and 200 UDP InErrors/sec reported. Am still investigating the cause of this and not sure yet if this is something I should be really worried about. All in all statsrelay is doing a great job, thanks again Jack :) |
One thing I forgot to mention is that while there is no built in fault tolerance with statsrelay I put together a basic bash script which I run every 10 minutes to check if the statsd hosts are up. If one of them is down and stops responding then it kills all statsrelay processes and fires them up again leaving out the failed statsd host so that it does not attempt to send metrics to it. So far this works but definitely an area which could be improved in many ways. |
Out of curiosity, @jjneely whats the advantage of running multiple stats-relay processes on a single server ? |
I should use some of these awesome diagrams for the actual documentation! Thanks! 300k/s matches my own benchmarks. I've run it hotter, but the UDP drop starts to become unmanageable, and that's what I'd like to avoid. I have pondered a re-write in C which could handle more throughput (similar to statsite), but not quite gotten around to my next set of Statsd scaling. |
it was a year ago when i evaluate idea of relaying statsd to central place and take down CPU usage from EC2 instances, improve percentile/histogram calculation and simpler maintenance. I think 200k-300k is high enough, because in most statsrelay will exist on single instance/k8s Pod with single app and i think 99% of cases will generate in most ten of thousand's metrics per second and there statsrelay is perfect. Multiple statsrelays and only couple statsites/telegraf's at the end. Main feature we need now is buffering on memory or/and disk with backfill when remote recover. It is discussed here #2. I think maybe soon when i get some spare time i will try to make new series of tests. |
Hey Jack,
Thanks for releasing statsrelay, this is a great tool! I'd like to run multiple statsrelay daemons behind a simple UDP load balancer with LVS in roundrobin mode and was wondering would the statsrelay daemons be able to share the same hash table? Would they each send the same metrics to the same backend server or would they each maintain their own hash table?
Thanks!
Dave
The text was updated successfully, but these errors were encountered: