Improve alert content and adjust their routes #11911

vkuznet · 2024-02-23T13:16:14Z

Impact of the new feature
Simplify debugging process during shift operations.

Is your feature request related to a problem? Please describe.
On MM I got a message about failure of transfer in CouchDB. There are two issues with it:

the alert routes to a specific email addresses and not to a common e-group or MM channel
- the alert was tagged as wmcore and this routes to dmwm-admins receiver which has hard-coded emails
- https://gitlab.cern.ch/cmsmonitoring/cmsmon-configs/-/blob/master/alertmanager/alertmanager.yaml?ref_type=heads#L54
- https://gitlab.cern.ch/cmsmonitoring/cmsmon-configs/-/blob/master/alertmanager/alertmanager.yaml?ref_type=heads#L260
the alert was initiated from MSTransferor.py codebase, see https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSTransferor/MSTransferor.py#L228, and should provide proper logging warning message, see https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSTransferor/MSTransferor.py#L554, but according to our logs it is not there, i.e. grep "failed request due to error posting" /cephfs/product/dmwm-logs/*ms-transferor* on vocms0750 does not reveal anything which makes very hard to debug the issue. Upon further investigation I found that this code has double spaces in alert description which are stripped off in AM message which explain why initially I can't find the matching in logs. The logs contains the following: Workflow: cmsunified_task_HIG-Run3Summer23BPixwmLHEGS-00402__v1_T_240222_192602_7720, failed request due to error posting to CouchDB (it has double spaces over hererequest due while alert itself does not.

Describe the solution you'd like
I suggest few possible improvements:

change hard-coded emails to Mattermost channel, and/or e-group
improve documentation of WMCore alerts to include location of alerts embedded in the WMCore codebase
fix logging message to appear identically both in alerts and logs

Describe alternatives you've considered
Leave as is and struggle with debugging.

Additional context
WMCore documentation about alerts and AlertManager configuration:

The text was updated successfully, but these errors were encountered:

mapellidario · 2025-01-22T16:42:17Z

I have a couple of suggestions that relate to the issue title, less with the issue description :)

include the cmsweb instance where the pod is runnning into the alert content, so that there is no need to crosscheck the podname with the output of kubectl get pods on all the clusters. one option could be to use the content of BASE_URL from the microservice configuration, used for example in data.reqmgr2Url.
add a config switch to turn off all notifications from a microservice, for all microservices and all alerts, see for example how it is used for msoutput, code and config

vkuznet · 2025-01-22T18:48:47Z

@mapellidario , thanks for suggestions, even though I think it will be useful they are not free and will require additional changes, in particular:

to extract python configuration we need to adjust AlertManagerAPI object to accept it in its constructur, this by itself will
add new dependency on AlertManagerAPI to always come with WM configuration object which I think will be a mistake since current code can be used without requiring such configuration, and therefore it will make AlertManagerAPI dependent on WMCore.Configuration

The config switch you are talking is present in WM code and proposed changes are fully encapsulated within AlertManagerAPI object. Therefore there is no need to add it inside of it since the upstream code (MS and others) will handle configuration properly.

That's said, I'm not against adding this, but I rather want to hear opinion of @amaltaro and @todor-ivanov about this suggestion.

vkuznet added New Feature BUG MSTransferor Monitoring Documentation labels Feb 23, 2024

amaltaro added this to WMCore quarterly developments Jan 7, 2025

amaltaro moved this to ToDo in WMCore quarterly developments Jan 7, 2025

anpicci changed the title ~~Improve alerts content and adjust their routes~~ Improve alert content and adjust their routes Jan 19, 2025

vkuznet self-assigned this Jan 21, 2025

vkuznet linked a pull request Jan 21, 2025 that will close this issue

Adjust AlertManagerAPI to avoid using multiple spaces in various attibutes of alert #12237

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve alert content and adjust their routes #11911

Improve alert content and adjust their routes #11911

vkuznet commented Feb 23, 2024

mapellidario commented Jan 22, 2025

vkuznet commented Jan 22, 2025

Improve alert content and adjust their routes #11911

Improve alert content and adjust their routes #11911

Comments

vkuznet commented Feb 23, 2024

mapellidario commented Jan 22, 2025

vkuznet commented Jan 22, 2025