You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Impact of the new feature
Simplify debugging process during shift operations.
Is your feature request related to a problem? Please describe.
On MM I got a message about failure of transfer in CouchDB. There are two issues with it:
the alert routes to a specific email addresses and not to a common e-group or MM channel
the alert was tagged as wmcore and this routes to dmwm-admins receiver which has hard-coded emails
the alert was initiated from MSTransferor.py codebase, see https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSTransferor/MSTransferor.py#L228, and should provide proper logging warning message, see https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSTransferor/MSTransferor.py#L554, but according to our logs it is not there, i.e. grep "failed request due to error posting" /cephfs/product/dmwm-logs/*ms-transferor* on vocms0750 does not reveal anything which makes very hard to debug the issue. Upon further investigation I found that this code has double spaces in alert description which are stripped off in AM message which explain why initially I can't find the matching in logs. The logs contains the following: Workflow: cmsunified_task_HIG-Run3Summer23BPixwmLHEGS-00402__v1_T_240222_192602_7720, failed request due to error posting to CouchDB (it has double spaces over hererequest due while alert itself does not.
Describe the solution you'd like
I suggest few possible improvements:
change hard-coded emails to Mattermost channel, and/or e-group
improve documentation of WMCore alerts to include location of alerts embedded in the WMCore codebase
fix logging message to appear identically both in alerts and logs
Describe alternatives you've considered
Leave as is and struggle with debugging.
Additional context
WMCore documentation about alerts and AlertManager configuration:
I have a couple of suggestions that relate to the issue title, less with the issue description :)
include the cmsweb instance where the pod is runnning into the alert content, so that there is no need to crosscheck the podname with the output of kubectl get pods on all the clusters. one option could be to use the content of BASE_URL from the microservice configuration, used for example in data.reqmgr2Url.
add a config switch to turn off all notifications from a microservice, for all microservices and all alerts, see for example how it is used for msoutput, code and config
@mapellidario , thanks for suggestions, even though I think it will be useful they are not free and will require additional changes, in particular:
to extract python configuration we need to adjust AlertManagerAPI object to accept it in its constructur, this by itself will
add new dependency on AlertManagerAPI to always come with WM configuration object which I think will be a mistake since current code can be used without requiring such configuration, and therefore it will make AlertManagerAPI dependent on WMCore.Configuration
The config switch you are talking is present in WM code and proposed changes are fully encapsulated within AlertManagerAPI object. Therefore there is no need to add it inside of it since the upstream code (MS and others) will handle configuration properly.
That's said, I'm not against adding this, but I rather want to hear opinion of @amaltaro and @todor-ivanov about this suggestion.
Impact of the new feature
Simplify debugging process during shift operations.
Is your feature request related to a problem? Please describe.
On MM I got a message about failure of transfer in CouchDB. There are two issues with it:
grep "failed request due to error posting" /cephfs/product/dmwm-logs/*ms-transferor*
on vocms0750 does not reveal anything which makes very hard to debug the issue. Upon further investigation I found that this code has double spaces in alert description which are stripped off in AM message which explain why initially I can't find the matching in logs. The logs contains the following:Workflow: cmsunified_task_HIG-Run3Summer23BPixwmLHEGS-00402__v1_T_240222_192602_7720, failed request due to error posting to CouchDB
(it has double spaces over hererequest due
while alert itself does not.Describe the solution you'd like
I suggest few possible improvements:
Describe alternatives you've considered
Leave as is and struggle with debugging.
Additional context
WMCore documentation about alerts and AlertManager configuration:
The text was updated successfully, but these errors were encountered: