Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The PMemModule is inaccessible #153

Open
cycyyy opened this issue Sep 20, 2020 · 18 comments
Open

The PMemModule is inaccessible #153

cycyyy opened this issue Sep 20, 2020 · 18 comments

Comments

@cycyyy
Copy link

cycyyy commented Sep 20, 2020

This is a followup issue of #149
OS version: Ubuntu 20.04 LTS (GNU/Linux 5.4.0-45-generic x86_64)
ipmctl version: Intel(R) Optane(TM) Persistent Memory Command Line Interface Version 02.00.00.3820

The cpu info:

lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          80
On-line CPU(s) list:             0-79
Thread(s) per core:              2
Core(s) per socket:              20
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
Stepping:                        7
CPU MHz:                         800.489
CPU max MHz:                     3900.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4200.00
Virtualization:                  VT-x
L1d cache:                       1.3 MiB
L1i cache:                       1.3 MiB
L2 cache:                        40 MiB
L3 cache:                        55 MiB
NUMA node0 CPU(s):               0-19,40-59
NUMA node1 CPU(s):               20-39,60-79
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled v
                                 ia prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user
                                  pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RS
                                 B filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; TSX disabled

The BIOS info

Handle 0x0000, DMI type 0, 26 bytes
BIOS Information
	Vendor: American Megatrends Inc.
	Version: 3.3
	Release Date: 02/21/2020
	Address: 0xF0000
	Runtime Size: 64 kB
	ROM Size: 32 MB
	Characteristics:
                 PCI is supported
		BIOS is upgradeable
		BIOS shadowing is allowed
		Boot from CD is supported
		Selectable boot is supported
		BIOS ROM is socketed
		EDD is supported
		5.25"/1.2 MB floppy services are supported (int 13h)
		3.5"/720 kB floppy services are supported (int 13h)
		3.5"/2.88 MB floppy services are supported (int 13h)
		Print screen service is supported (int 5h)
		Serial services are supported (int 14h)
		Printer services are supported (int 17h)
		ACPI is supported
		USB legacy is supported
		BIOS boot specification is supported
		Targeted content distribution is supported
		UEFI is supported
	BIOS Revision: 5.14

The sever has 256GiB DRAM and 512GiB PMem
However, the ipmctl shows:

ipmctl show -memoryresources
 MemoryType   | DDR                 | PMemModule  | Total
================================================================
 Volatile     | 504.000 GiB         | 0.000 GiB   | 504.000 GiB
 AppDirect    | -                   | 0.000 GiB   | 0.000 GiB
 Cache        | 256.000 GiB         | -           | 256.000 GiB
 Inaccessible | 17179868680.000 GiB | 505.689 GiB | 1.689 GiB
 Physical     | 256.000 GiB         | 505.689 GiB | 761.689 GiB

Any suggestion to config the BIOS or OS to fix this?

@sscargal
Copy link
Contributor

@cycyyy

What system are you using? (dmidecode -t baseboard)

It looks like you successfully provisioned Memory Mode, but we're misreporting the information, specifically the 'Inaccessible' row. Is that a correct understanding of this issue? If you use any of the following commands, we can confirm that you have 512GB of volatile memory:

  • lsmem
  • top
  • cat /proc/meminfo

The inaccurate DDR Inaccessible value is reminiscent of #135. A workaround was added to 02.00.00.3797 and validated on Dell systems. It looks like we may have a similar issue for AMI BIOS systems.

Suggested next actions

  • Update the BIOS, BMC, and PMem firmware to the latest available by your system provider to see if this resolves the issue.
  • Collect ipmctl debug logs when executing ipmctl show -memoryresources. Follow the instructions in my blog - How To Enable Debug Logging in ipmctl. We'll want DBG_LOG_LEVEL=4 results. You do not need to provide the ipmctl recording.
  • [optional] Try using ipmctl v1.x. There are no packages available, but you can build it from the source. See Building and Installing IPMCTL from Source on Linux for step-by-step instructions. You don't need to install it (make install) as you can run ipmctl from the local build directory. Remember to update your LD_LIBRARY_PATH to source the correct libraries. Note, the output in version 1.x is different to v2.x.

@cycyyy
Copy link
Author

cycyyy commented Oct 14, 2020

@sscargal
Sorry for the late reply.
Here are some sys info:

# dmidecode -t baseboard
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.2.1 present.
# SMBIOS implementations newer than version 3.2.0 are not
# fully supported by this version of dmidecode.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
	Manufacturer: Supermicro
	Product Name: X11DPG-QT
	Version: 1.10A
	Serial Number: UM196S006546
	Asset Tag: To be filled by O.E.M.
	Features:
		Board is a hosting board
		Board is replaceable
	Location In Chassis: To be filled by O.E.M.
	Chassis Handle: 0x0003
	Type: Motherboard
	Contained Object Handles: 0

Handle 0x0017, DMI type 41, 11 bytes
Onboard Device
	Reference Designation: ASPEED Video AST2500
	Type: Video
	Status: Enabled
	Type Instance: 1
	Bus Address: 0000:05:00.0

Handle 0x0018, DMI type 41, 11 bytes
Onboard Device
	Reference Designation: Intel Ethernet X550 #1
	Type: Ethernet
	Status: Enabled
	Type Instance: 1
	Bus Address: 0000:01:00.0

Handle 0x0019, DMI type 41, 11 bytes
Onboard Device
	Reference Designation: Intel Ethernet X550 #2
	Type: Ethernet
	Status: Enabled
	Type Instance: 2
	Bus Address: 0000:01:00.1
# lsmem
RANGE                                  SIZE  STATE REMOVABLE BLOCK
0x0000000000000000-0x000000007fffffff    2G online       yes     0
0x0000000100000000-0x0000007e7fffffff  502G online       yes 2-252

Memory block size:         2G
Total online memory:     504G
Total offline memory:      0B
# cat /proc/meminfo
MemTotal:       519760656 kB
MemFree:        415036256 kB
MemAvailable:   428534128 kB
Buffers:          185344 kB
Cached:         15925956 kB
SwapCached:         3312 kB
Active:         102444320 kB
Inactive:         634612 kB
Active(anon):   86972176 kB
Inactive(anon):     6332 kB
Active(file):   15472144 kB
Inactive(file):   628280 kB
Unevictable:       18448 kB
Mlocked:           18448 kB
SwapTotal:       8388604 kB
SwapFree:        8347532 kB
Dirty:               588 kB
Writeback:             0 kB
AnonPages:      86970068 kB
Mapped:           195280 kB
Shmem:              2604 kB
KReclaimable:     320388 kB
Slab:             793500 kB
SReclaimable:     320388 kB
SUnreclaim:       473112 kB
KernelStack:       16256 kB
PageTables:       187100 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    268268932 kB
Committed_AS:   157871772 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      261396 kB
VmallocChunk:          0 kB
Percpu:            90880 kB
HardwareCorrupted:     0 kB
AnonHugePages:    493568 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      945420 kB
DirectMap2M:    95168512 kB
DirectMap1G:    434110464 kB

Our BIOS version has been updated but the issue is not fixed.
I tried to print debug logs but it's failed:

# sudo ipmctl set -preferences DBG_LOG_LEVEL=4
Set DBG_LOG_LEVEL=4: Success

# sudo  ipmctl show -preferences
CLI_DEFAULT_DIMM_ID=HANDLE
CLI_DEFAULT_SIZE=GiB
APPDIRECT_SETTINGS=RECOMMENDED
DBG_LOG_LEVEL=4

# sudo ipmctl show -dimm
 DimmID | Capacity    | LockState | HealthState | FWVersion
===============================================================
 0x0020 | 126.422 GiB | Disabled  | Healthy     | 01.02.00.5367
 0x0120 | 126.422 GiB | Disabled  | Healthy     | 01.02.00.5367
 0x1020 | 126.422 GiB | Disabled  | Healthy     | 01.02.00.5367
 0x1120 | 126.422 GiB | Disabled  | Healthy     | 01.02.00.5367

# head -12 /var/log/ipmctl/debug.log
head: cannot open '/var/log/ipmctl/debug.log' for reading: No such file or directory

I'm trying the 1.x versions and I will give feedback later.

@sscargal
Copy link
Contributor

I tried to print debug logs but it's failed:

That was my fault, sorry. I forgot to provide the verbose flag option. The command should be ipmctl show -v -memoryresources.

You can capture the debug output that's printed to SDTOUT using ipmctl show -v -memoryresources | tee ipmctl_show_-v_-memoryresources.out

@liupeng0518
Copy link

Same issue.

@sscargal
Copy link
Contributor

@liupeng0518 @cycyyy I think the system may be reading a partial or old config. Can you perform a secure erase and/or remove the PCD (Platform Config Area) and re-provision the PMem and let me know if this helps.

To remove the PCD area using

$ sudo ipmctl delete -f -dimm -pcd
$ sudo ipmctl show-memoryresources

The -showmemoryresources command should return a message confirming no PCD can be read. If confirmed, you can proceed to create your new PMem config and reboot.

Example:

# ipmctl delete -f -dimm -pcd

Clear Config partition(s) on PMem module 0x0001: Success
Clear Config partition(s) on PMem module 0x0011: Success
Clear Config partition(s) on PMem module 0x0021: Success
Clear Config partition(s) on PMem module 0x0101: Success
Clear Config partition(s) on PMem module 0x0111: Success
Clear Config partition(s) on PMem module 0x0121: Success
Clear Config partition(s) on PMem module 0x1001: Success
Clear Config partition(s) on PMem module 0x1011: Success
Clear Config partition(s) on PMem module 0x1021: Success
Clear Config partition(s) on PMem module 0x1101: Success
Clear Config partition(s) on PMem module 0x1111: Success
Clear Config partition(s) on PMem module 0x1121: Success

Data dependencies may result in other commands being affected. A system reboot is required before all changes will take effect.

# ipmctl show -memoryresources
One or more PMem modules have invalid PCD data. A platform reboot is recommended to restore valid PCD data, then try again.

To ensure all config [and data] is erased, there are multiple ways to securely erase the PMem modules shown in How to Securely Erase Data on Intel® Optane™ Persistent Memory

@cycyyy
Copy link
Author

cycyyy commented Oct 18, 2020

@sscargal thank you so much!
I have removed PCD successfully. Is there any doc for creating new PMem configs?

@liupeng0518
Copy link

@sscargal I delete PCD, but still same problem.

@sscargal
Copy link
Contributor

@cycyyy Once the PCD is removed, you need to recreate the Memory Mode or AppDirect goal, eg:

// AppDirect (Interleaved)
$ sudo ipmctl create -goal PersistentMemoryType=AppDirect

//MemoryMode
$ sudo ipmctl create -goal MemoryMode=100

@sscargal
Copy link
Contributor

@liupeng0518 What system(s) are you using? If you can provide the ipmctl debug log, that will be helpful.

@liupeng0518
Copy link

liupeng0518 commented Oct 20, 2020

@sscargal centos7.7 and kernel version:5.9.0-1
ipmctl_show_-v_-memoryresources.log

@cycyyy
Copy link
Author

cycyyy commented Oct 25, 2020

@sscargal
After deleting PCD and create AppDirect goal as

sudo ipmctl create -goal PersistentMemoryType=AppDirect
sudo reboot

I got the error:

One or more PMem modules have invalid PCD data. A platform reboot is recommended to restore valid PCD data, then try again.

Reboot didn't solve the error

@sscargal
Copy link
Contributor

@cycyyy and @liupeng0518 - Can you update the ipmctl package to 02.00.00.3825 and see if the issue is still reproducible, please.

Can you also upload an SOS report from your affected systems, please. Thanks. If you have sos installed, it may have the pmem plugin. If not, you can download the latest version and run it locally without installing it:

// Clone the SOS Report repo
$ git clone https://github.com/sosreport/sos

// Run an SOS report
$ cd sos
$ sudo ./bin/sos report 
(You don't have to enter any information for the questions)

@cycyyy
Copy link
Author

cycyyy commented Oct 27, 2020

@sscargal I have updated ipmctl to v02.00.00.3825 but the issue is still there.

The SOS report is attached.
sosreport-brs2-1-2020-10-27-vvbeyhh.tar.xz.zip

Also, I can let u access my server if u want.
Thank u so much!

@sscargal
Copy link
Contributor

sscargal commented Nov 5, 2020

@cycyyy @liupeng0518 Apologies for the delay. I didn't see anything in the SOS Report to indicate an issue in the OS or PMem.

Can you check the BIOS please. I suspect your platform may be configured for Memory Mode, so attempts to switch to AppDirect using ipmctl won't work.

The config can be seen and changed in the BIOS:

BIOS -> Advanced -> Intel Optane DC Persistent Memory Configuration -> Create Goal Configuration, then use Reserved [%], AppDirect, and Memory mode [%] to configure the system to the desired config. Save/Apply the changes and reboot.

Let me know if this works.

/Steve

@cycyyy
Copy link
Author

cycyyy commented Nov 20, 2020

@sscargal
Sorry for the late reply. I can't access BIOS directly thus I have to wait for PEs' response.
BIOS doesn't work and here are some pictures of the error.

driver_error
no_pcd

We have updated the BIOS version before thus is this a hardware issue?

@sscargal
Copy link
Contributor

@cycyyy The message "One of more DIMMs do not have PCD data. A Platform reboot is recommended to restore valid PCD data, then try again." has two causes:

  1. The message is expected when the PMem has been factory reset using either ipmctl delete -pcd or a secure erase has been performed. Since we have erased the PCD and LSA, there's nothing to read so we see that message. Creating a new goal creates a new and valid PCD/LSA which resolves that message.

  2. As you know, there was a known issue (The goal is not applied after reboot #149) using ipmctl versions 02.00.00.3800 - 02.00.00.3820 where it would create an invalid PCD/LSA. You have the fix for this. Provisioning PMem through the BIOS didn't have the issue and should work assuming the HW is good. I have no reason to believe your HW is bad at this point.

At this point, my recommendation is to contact your HW vendor to have them help you setup the platform. Working onsite or through a virtual working session would be a good next step.

@spawnflagger
Copy link

Just a datapoint - I had this same issue with a new Super Micro server. (PM200 NVDIMMs on Xeon 53xx Ice Lake)
Regardless of using BIOS utility or ipmctl (compiled from source on Ubuntu 20.04.4 LTS), the goal wouldn't apply and no regions were defined after reboot.

The fix was to clear the CMOS by pulling all AC power cords, removing the motherboard battery, and setting the "clear CMOS" jumper for a few seconds.
Afterwards go into BIOS, load defaults, and with the Optane utility (under Advanced menu) could create a goal, reboot, then create namespaces.

(Hope that saves someone a few hours and hair from being pulled out.)

@StevenPontsler
Copy link
Contributor

@spawnflagger Thanks for sharing the experience and advise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants