Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firmware update stuck #183

Open
dbshch opened this issue Dec 29, 2021 · 3 comments
Open

Firmware update stuck #183

dbshch opened this issue Dec 29, 2021 · 3 comments

Comments

@dbshch
Copy link

dbshch commented Dec 29, 2021

I met some severe problems when using DCPMM. So I'm trying to update the firmware of the PMs. But the updating process has been stuck for 4 hours and is still stuck now. The issue seems like the issue #130 , but I think I'm using the latest ipmctl.

Server: Huawei 2288h v5
BIOS: the latest v799
Firmware: from 01.02.00.5355 to 5417
OS: ubuntu 21.04
ipmctl: from ubuntu repo, I think it is 02.00.00.3852. "ipmctl version" command also stuck now.

The firmware is downloaded from Huawei's support site. And it is released on the same date as the v799 BIOS release, so I think the BIOS should work with this firmware.

Now executing every ipmctl command will also be stuck (but ndctl commands work). Even executing "ipmctl version -v" will show logs like this repeatedly every second:

NVM_DBG_LOGGER NVDIMM-VERB:Exiting Dimm.c::FwCmdIdDimm(): 0x0
NVM_DBG_LOGGER NVDIMM-VERB:Entering NvmDimmConfig.c::SetFisTransportAttributes()
NVM_DBG_LOGGER NVDIMM-VERB:Exiting NvmDimmConfig.c::SetFisTransportAttributes(): 0x0
NVM_DBG_LOGGER NVDIMM-VERB:Entering Dimm.c::PopulateDimmBsrAndBootStatusBitmask()
NVM_DBG_LOGGER NVDIMM-VERB:Entering Dimm.c::FwCmdGetBsr()
NVM_DBG_LOGGER NVDIMM-VERB:Entering Utility.c::OpenNvmDimmProtocol()
NVM_DBG_LOGGER NVDIMM-VERB:Entering Utility.c::GetDriverHandle()
NVM_DBG_LOGGER NVDIMM-VERB:Exiting Utility.c::GetDriverHandle(): 0x0
NVM_DBG_LOGGER NVDIMM-VERB:Entering Utility.c::CheckConfigProtocolVersion()
NVM_DBG_LOGGER NVDIMM-VERB:Exiting Utility.c::CheckConfigProtocolVersion(): 0x0
NVM_DBG_LOGGER NVDIMM-VERB:Exiting Utility.c::OpenNvmDimmProtocol(): 0x0
NVM_DBG_LOGGER NVDIMM-VERB:Entering NvmDimmConfig.c::GetFisTransportAttributes()
NVM_DBG_LOGGER NVDIMM-VERB:Exiting NvmDimmConfig.c::GetFisTransportAttributes(): 0x0
NVM_DBG_LOGGER NVDIMM-DBG:Dimm.c::PassThru:7337: Calling 0xfd:0x3 over ddrt sp on DCPMM 0x101

Can I interrupt/kill the updating process and try updating with ndctl?

On the other hand, I met some severe problems when using DCPMM. I'm not sure are these problems related to this issue. The server reports "System memory MRC fatal error detected". In recent 2 days, the data wrote to the PM (fsdax mounted) are lost after the server restarts. I also met some very strange performance behaviors but I'm not sure whether they are due to the server and PM problems or they are expected behaviors.

@dbshch
Copy link
Author

dbshch commented Dec 29, 2021

The process finished after 5 hours for 4x128G PM. Is this a normal behavior?

@nolanhergert
Copy link
Contributor

nolanhergert commented Dec 29, 2021

You're right, it looks exactly like that other issue.

Yeah, see if ndctl has the same behavior. It's using a slightly different pathway, so it might perform a lot better.

I think your particular BIOS implementation is running our payload transactions in SMM mode for some reason, which is subject to throttling. 5 hours is about right, if you assume one second per 64 bytes for a ~300KB firmware image. If you force it to use large payload mailbox, you'll still have that 1 second penalty but you'll get in a lot more data per 1s and it should complete much faster.

ipmctl load -ddrt -lpmb -v -source <fw.bin> -dimm

Let me know what you find out. We didn't default to this behavior because our reference BIOS didn't throttle ddrt small payload transactions, so they completed a few seconds faster than using large payload.

Maybe start a discussion with Huawei and ask them to check their implementation relative to Intel's reference BIOS in this regard?

As to your other questions, @sscargal might have some better insight.

@sscargal
Copy link
Contributor

@dbshch I agree with Nolan that opening a support ticker with Huawei is the correct next step. We can't provide ODM/OEM support through this GitHub channel.

I would start by understanding and resolving the MRC issues first to eliminate the hardware issues. Then you can look at the performance issue(s). There are specific Intel Optane Persistent Memory support channels that can provide more general support than this ipmctl tool specific GitHub community, though OEM/ODM specific issues need to be addressed directly with the vendor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants