Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] How to deal in handlers with check exit status out of bounds #127

Open
kali-brandwatch opened this issue Apr 11, 2019 · 4 comments
Open

Comments

@kali-brandwatch
Copy link

kali-brandwatch commented Apr 11, 2019

I recently launched a PR for handler-slack-multichannel.rb (see here) in which I suggested changing the approach on how to deal with exit codes from checks that are out of bounds (out of the [0-3] range).

In the discussion that started in the PR I suggested following what (at least in my experience) has been a standard from nagios conventions into pretty much anything I used that follows the nagios path (including icinga 1&2 and sensu), that when a check script returns a non [0-3] exit code it should not be masked as UNKNOWN status.

The logic behind this approach is the following:

  • typically the person who coded the script will add a specific exit status code 3 for scenarios in which the check can not assess the state of the service or resource monitored, meaning the code is purposely catching a result.
  • check scripts however die ungracefully due to some non expected reasons, and therefore it will return whatever exit code its execution might produce.
  • common causes for such non-standard (in my experience) often include:
    • check script depends on sudo privileges, but the relevant sudoers entry doesn't exist/has been removed
    • check script has wrong permissions (wrong ownership, wrong chmod)
    • check script is trying to read from a file that does not exist / has no permissions to
    • check script runs some command (e.g. curl) that exits in error code and is not caught
    • syntax errors introduced in the check script
    • in distributed config management systems (puppet et.al.) a distributed working team might introduce one of the above accidentally breaking a check's functionality

If we follow the convention of returning only standard [0-3] status codes, we might be masking any of the above under an unknown error message.

In my personal experience it has always been useful to be able to see the explicit error code and report it to the operator, since this helps identifying the issue faster and clearer.

Alternatively, I would have this as a configurable option that would allow handlers to set an override for any non-standard status code as whatever we prefer (some people would like to have it as UNKNOWN but I have also been in scenarios where it was preferred to report such failures as CRITICAL).

I would like to hear more opinions regarding this.

@majormoses
Copy link
Member

I have been thinking about this, I want you to know I have not been ignoring this I just wanted to gather some thoughts and allow time for some other people to weigh in. It's been a number of years since I have played with nagios so my memory is a bit hazy. I can't find any documentation on their definition of "Out of Bounds". If you could point me to that it would be helpful. I only found sparse references in the cpp code (which its been a while since I have written):

Generally when I think "Out of Bounds" I think in context of running out of a resource, going beyond the index of an array, etc. To me this translates to a bash 255 which indicates that a process tried to return a status code that was greater than 255.

After reading through this section of code it becomes a bit more clear as to its intent. I am not opposed to creating a specific library method to map status exit codes to more meaningful error messages. I think I could be talked into supporting "Out of Bounds" on the whole but I would want to see others weigh in on here as well.

@jspaleta
Copy link
Contributor

Naively...it could seem useful to map error code to whatever errno thinks the status means in some handlers.. as errno defines a defacto standard for errors messages.

errno -ls

We have to be careful here...what the return code could mean maybe sensu client/agent specific..and maybe different than what the sensu server/backend understands as standard errors codes...

But to do this right, we'd have to be able to instruct the handler as to which error code override mapping to use.

@mcbsdEU
Copy link

mcbsdEU commented Apr 30, 2019

Hi,

"If we follow the convention of returning only standard [0-3] status codes, we might be masking any of the above under an unknown error message."

My opinion:
The error codes 0-2 should be reserved for mapping to event severity. Any failure (as listed examples in description of this issue) should be handled sideways (some logging, sentry, ...).
I prefer to exit with 1,2 with explicit exception rewritten to human understandable output. Then it is easy to run the plugin with debug option or look into logging mechanism.

@majormoses
Copy link
Member

I have thought more about this and I think I am leaning back towards my initial stance that sensu should concern itself with error code 0-2 as @mcbsdEU and @jspaleta suggests as these indicate event severity (not exit codes themselves even if that is how we determine severity) and everything else should be a an unknown. OOB is essentially the same thing as unknown as it is an unknown status code which is defined as either negative exit codes or < 2. While we often attribute the exit code 3 to unknown it really means OOB from 0-2. While it is tempting to attempt to determine what the status code means the truth is that it requires a lot of understanding of:

  • the system its running on
  • how the check is executed
  • what that status code may mean to the specific application
  • is the error coming from the method of execution (shell) or the kernel

It's very easy to say "oh a 127 means command/file not found". But is that really the case on all platforms, shells, and methods of execution? While everything I maintain is linux and as a consequence sometimes do not consider the various behaviors on other platforms we need to be cautious of this trap. From a sensu server standpoint it can't know what it means as it is simply responding to information that the client has supplied. The client simply captures the input, output, exit code, and message and forwards this to the server. The message, knowledge of the platform/system, exit code, and method of execution is allows us to determine the meaning of the alert and how to respond. As skilled practitioners we often do not realize how much we intrinsically know about a system because we are so familiar with it. Unless we want to build a very comprehensive system that inspects all of these in (near) real time which would incur significant performance penalties and deep understanding of every system I think we should stick to having sensu concern itself with alert severity rather than trying to guess what it means.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants