ECC metrics #274

aieri · 2024-07-03T23:46:41Z

ECC memory correction counts are useful for predicting DIMM failures. Having a dedicated metric would be very useful, as well as a related alert (e.g. rate of correctable errors over $TIME above $THRESHOLD).

It looks like this could be provided in two ways:

export values in the memory controller subdirectory /sys/devices/system/edac/mc/mc0/ directly
export values from the rasdaemon DB. Note:
- rasdaemon is available in universe
- https://github.com/sanecz/prometheus-rasdaemon-exporter is an example of a rasdaemon-specific exporter

I don't know if using rasdaemon would provide benefits over reading the values directly given that in our case the data analysis would happen on the prometheus side.

The text was updated successfully, but these errors were encountered:

aieri added the enhancement New feature or request label Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECC metrics #274

ECC metrics #274

aieri commented Jul 3, 2024

ECC metrics #274

ECC metrics #274

Comments

aieri commented Jul 3, 2024