Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nvme attribute check-list when auto interface is given and device… #97

Merged
merged 1 commit into from
Mar 15, 2024

Conversation

ymartin-ovh
Copy link
Contributor

… is nvme

@ymartin-ovh
Copy link
Contributor Author

Got this on nvme device with -i auto:

/usr/lib/nagios/ovh/check_smart -i auto -g /dev/nvme0 --debug
Found /dev/nvme0
###########################################################
(debug) CHECK 1: getting overall SMART health status for  
###########################################################


(debug) executing:
sudo /usr/sbin/smartctl -d auto -Hi /dev/nvme0

(debug) output:
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.15.41-ovh-vps-grsec-zfs-classid] (local build)
 Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
 
 === START OF INFORMATION SECTION ===
 Model Number:                       SAMSUNG MZVL2512HCJQ-00B07
 Serial Number:                      S63CNF0R415493
 Firmware Version:                   GXA7302Q
 PCI Vendor/Subsystem ID:            0x144d
 IEEE OUI Identifier:                0x002538
 Total NVM Capacity:                 512,110,190,592 [512 GB]
 Unallocated NVM Capacity:           0
 Controller ID:                      6
 Number of Namespaces:               1
 Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
 Namespace 1 Utilization:            462,648,926,208 [462 GB]
 Namespace 1 Formatted LBA Size:     512
 Namespace 1 IEEE EUI-64:            002538 b411b778d4
 Local Time is:                      Wed Mar  6 17:27:20 2024 UTC
 
 === START OF SMART DATA SECTION ===
 SMART overall-health self-assessment test result: PASSED
 


(debug) parsing line:
Model Number:                       SAMSUNG MZVL2512HCJQ-00B07


(debug) found model:  SAMSUNG MZVL2512HCJQ-00B07

(debug) parsing line:
Serial Number:                      S63CNF0R415493


(debug) found serial number S63CNF0R415493

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK
###########################################################
(debug) CHECK 2: getting silent SMART health check
###########################################################


(debug) executing:
sudo /usr/sbin/smartctl -d auto -q silent -A /dev/nvme0

(debug) exit code:
0

(debug) zero exit code, status OK

###########################################################
(debug) CHECK 3: getting detailed statistics from attributes
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing
###########################################################


(debug) executing:
sudo /usr/sbin/smartctl -d auto -A /dev/nvme0

(debug) output:
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.15.41-ovh-vps-grsec-zfs-classid] (local build)
 Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
 
 === START OF SMART DATA SECTION ===
 SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
 Critical Warning:                   0x00
 Temperature:                        38 Celsius
 Available Spare:                    83%
 Available Spare Threshold:          10%
 Percentage Used:                    19%
 Data Units Read:                    83,833,423 [42.9 TB]
 Data Units Written:                 69,316,785 [35.4 TB]
 Host Read Commands:                 1,241,781,735
 Host Write Commands:                1,632,519,014
 Controller Busy Time:               36,946
 Power Cycles:                       40
 Power On Hours:                     48,708
 Unsafe Shutdowns:                   26
 Media and Data Integrity Errors:    114
 Error Information Log Entries:      114
 Warning  Comp. Temperature Time:    0
 Critical Comp. Temperature Time:    0
 Temperature Sensor 1:               38 Celsius
 Temperature Sensor 2:               48 Celsius
 


(debug) Raw Check List ATA: Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count,Erase_Fail_Count_Total
(debug) Raw Check List NVMe: Media_and_Data_Integrity_Errors
(debug) Exclude List for Checks: 
(debug) Exclude List for Perfdata: 
(debug) Warning Thresholds:

(debug) gathered perfdata:


###########################################################
(debug) LOCAL STATUS: OK, FINAL STATUS: OK
###########################################################


(debug) final status/output: OK
(debug) drives  ok: [/dev/nvme0] - Device is clean
(debug) drives nok: 
(debug)   msg_list: [/dev/nvme0] - Device is clean

OK: [/dev/nvme0] - Device is clean|

@ymartin-ovh
Copy link
Contributor Author

I expect nvme attribute checks when device is nvme and -i auto is given:

Found /dev/nvme0
###########################################################
(debug) CHECK 1: getting overall SMART health status for  
###########################################################


(debug) executing:
sudo /usr/sbin/smartctl -d auto -Hi /dev/nvme0

(debug) output:
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.15.41-ovh-vps-grsec-zfs-classid] (local build)
 Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
 
 === START OF INFORMATION SECTION ===
 Model Number:                       SAMSUNG MZVL2512HCJQ-00B07
 Serial Number:                      S63CNF0R415493
 Firmware Version:                   GXA7302Q
 PCI Vendor/Subsystem ID:            0x144d
 IEEE OUI Identifier:                0x002538
 Total NVM Capacity:                 512,110,190,592 [512 GB]
 Unallocated NVM Capacity:           0
 Controller ID:                      6
 Number of Namespaces:               1
 Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
 Namespace 1 Utilization:            462,648,926,208 [462 GB]
 Namespace 1 Formatted LBA Size:     512
 Namespace 1 IEEE EUI-64:            002538 b411b778d4
 Local Time is:                      Wed Mar  6 17:36:56 2024 UTC
 
 === START OF SMART DATA SECTION ===
 SMART overall-health self-assessment test result: PASSED
 


(debug) parsing line:
Model Number:                       SAMSUNG MZVL2512HCJQ-00B07


(debug) found model:  SAMSUNG MZVL2512HCJQ-00B07

(debug) parsing line:
Serial Number:                      S63CNF0R415493


(debug) found serial number S63CNF0R415493

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK
###########################################################
(debug) CHECK 2: getting silent SMART health check
###########################################################


(debug) executing:
sudo /usr/sbin/smartctl -d auto -q silent -A /dev/nvme0

(debug) exit code:
0

(debug) zero exit code, status OK

###########################################################
(debug) CHECK 3: getting detailed statistics from attributes
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing
###########################################################


(debug) executing:
sudo /usr/sbin/smartctl -d auto -A /dev/nvme0

(debug) output:
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.15.41-ovh-vps-grsec-zfs-classid] (local build)
 Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
 
 === START OF SMART DATA SECTION ===
 SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
 Critical Warning:                   0x00
 Temperature:                        38 Celsius
 Available Spare:                    83%
 Available Spare Threshold:          10%
 Percentage Used:                    19%
 Data Units Read:                    83,833,423 [42.9 TB]
 Data Units Written:                 69,317,103 [35.4 TB]
 Host Read Commands:                 1,241,781,735
 Host Write Commands:                1,632,532,652
 Controller Busy Time:               36,946
 Power Cycles:                       40
 Power On Hours:                     48,708
 Unsafe Shutdowns:                   26
 Media and Data Integrity Errors:    114
 Error Information Log Entries:      114
 Warning  Comp. Temperature Time:    0
 Critical Comp. Temperature Time:    0
 Temperature Sensor 1:               38 Celsius
 Temperature Sensor 2:               47 Celsius
 


(debug) Raw Check List ATA: Current_Pending_Sector Reallocated_Sector_Ct Program_Fail_Cnt_Total Uncorrectable_Error_Cnt Offline_Uncorrectable Runtime_Bad_Block Reported_Uncorrect Reallocated_Event_Count Erase_Fail_Count_Total
(debug) Raw Check List NVMe: Media_and_Data_Integrity_Errors
(debug) Exclude List for Checks: 
(debug) Exclude List for Perfdata: 
(debug) Warning Thresholds:

(debug) Critical_Warning not in raw check list (raw value: 0x00)

(debug) Temperature not in raw check list (raw value: 38)

(debug) Available_Spare not in raw check list (raw value: 83)

(debug) Available_Spare_Threshold not in raw check list (raw value: 10)

(debug) Percentage_Used not in raw check list (raw value: 19)

(debug) Data_Units_Read not in raw check list (raw value: 83833423)

(debug) Data_Units_Written not in raw check list (raw value: 69317103)

(debug) Host_Read_Commands not in raw check list (raw value: 1241781735)

(debug) Host_Write_Commands not in raw check list (raw value: 1632532652)

(debug) Controller_Busy_Time not in raw check list (raw value: 36946)

(debug) Power_Cycles not in raw check list (raw value: 40)

(debug) Power_On_Hours not in raw check list (raw value: 48708)

(debug) Unsafe_Shutdowns not in raw check list (raw value: 26)

(debug) Media_and_Data_Integrity_Errors is non-zero (114)

(debug) Error_Information_Log_Entries not in raw check list (raw value: 114)

(debug) Warning__Comp_Temperature_Time not in raw check list (raw value: 0)

(debug) Critical_Comp_Temperature_Time not in raw check list (raw value: 0)

(debug) Temperature_Sensor_1 not in raw check list (raw value: 38)

(debug) Temperature_Sensor_2 not in raw check list (raw value: 47)

(debug) gathered perfdata:


###########################################################
(debug) LOCAL STATUS: WARNING, FINAL STATUS: WARNING
###########################################################


(debug) final status/output: WARNING
(debug) drives  ok: 
(debug) drives nok: [/dev/nvme0] - [/dev/nvme0] - Media_and_Data_Integrity_Errors is non-zero (114)[/dev/nvme0] - 
(debug)   msg_list: [/dev/nvme0] - [/dev/nvme0] - Media_and_Data_Integrity_Errors is non-zero (114)[/dev/nvme0] - 

WARNING: [/dev/nvme0] - [/dev/nvme0] - Media_and_Data_Integrity_Errors is non-zero (114)[/dev/nvme0] - |

@Napsty Napsty self-assigned this Mar 15, 2024
@Napsty Napsty added the bug label Mar 15, 2024
@Napsty
Copy link
Owner

Napsty commented Mar 15, 2024

Awesome find, thanks!
Successfully tested on a server with NVME (and ATA) drives.

@Napsty Napsty merged commit 90e102c into Napsty:master Mar 15, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants