You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ECC memory correction counts are useful for predicting DIMM failures. Having a dedicated metric would be very useful, as well as a related alert (e.g. rate of correctable errors over $TIME above $THRESHOLD).
It looks like this could be provided in two ways:
export values in the memory controller subdirectory /sys/devices/system/edac/mc/mc0/ directly
I don't know if using rasdaemon would provide benefits over reading the values directly given that in our case the data analysis would happen on the prometheus side.
The text was updated successfully, but these errors were encountered:
ECC memory correction counts are useful for predicting DIMM failures. Having a dedicated metric would be very useful, as well as a related alert (e.g. rate of correctable errors over $TIME above $THRESHOLD).
It looks like this could be provided in two ways:
/sys/devices/system/edac/mc/mc0/
directlyrasdaemon
DB. Note:I don't know if using rasdaemon would provide benefits over reading the values directly given that in our case the data analysis would happen on the prometheus side.
The text was updated successfully, but these errors were encountered: