Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rabbitmq_node_up reports it self up ***and*** the other cluster nodes #70

Open
expanderbolt opened this issue Jan 23, 2019 · 3 comments

Comments

@expanderbolt
Copy link

expanderbolt commented Jan 23, 2019

I get the double amount of metrics since the exporter counts every node twice (it reports its cluster friend too).
Why?
Example:

root@xxx: curl http://localhost:15672/api/metrics|grep node_up
#HELP rabbitmq_node_up Node runnning status
rabbitmq_node_up{name="[email protected]",type="disc"} 1
rabbitmq_node_up{name="[email protected]",type="disc"} 1

The nodes are clustered with https://www.rabbitmq.com/cluster-formation.html#peer-discovery-aws

@michaelklishin
Copy link
Contributor

There isn't enough information to tell for sure but my best guess is that when every node is queried and reports all of its peers are up, the counters are added instead of treated as a boolean gauge.

@BoemmLA
Copy link

BoemmLA commented Mar 4, 2019

Seems that rabbitmq_node_up reports reachable nodes from each node itself ...
So in a 3 node cluster setup reports each node itself and the 2 other nodes which sums up to 9 rabbitmq_node_up metrics.
Well that can be used to show something like
nodeXX can reach X nodes ...

Its a bit confusing, if you run rabbitmq as cluster, since this is not well prepared I would say.
Try to map the IP of the server you query against the name ... should work.

Especially what I miss is simple the cluster name of the whole rabbitmq cluster, to by able to say:
Cluster is up with X nodes ...
Seems this info is not exported by the plugin ...

@michaelklishin
Copy link
Contributor

We recently discussed this and concluded that there are significant benefits to collecting data from a single node and then aggregating at "display time". This should be covered in the docs now that #73 was merged.

@BoemmLA I'm sorry but I think you are greatly oversimplifying how distributed systems fail. If a node has 3 clusters but A cannot reach C and C cannot reach B but all other links are up, how many nodes does that cluster have? So this is a very convenient and very misleading metric. To some extent it is clarified by another one, the number of reported partitions in the cluster. Then we not only monitor the number of vertices in the graph but also the number of problematic edges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants