Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes on aarch64 due to off-by-one errors #264

Open
suve opened this issue Aug 18, 2024 · 7 comments
Open

Crashes on aarch64 due to off-by-one errors #264

suve opened this issue Aug 18, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@suve
Copy link

suve commented Aug 18, 2024

Using the latest cpufetch v1.06, the program builds fine, but then crashes when ran on aarch64.
Debugging with Valgrind led me to find two off-by-one errors:


Commit 1b746bc introduced this code to src/common.pci.c:

    struct pci_device* dev = pci->devices[i];
    int path_size = strlen(PCI_PATH) + strlen(dev->path) + 2;

    // Read vendor_id
    char *vendor_id_path = emalloc(sizeof(char) * (path_size + strlen("vendor")));
    sprintf(vendor_id_path, "%s/%s/%s", PCI_PATH, dev->path, "vendor");

/* [...] some code omitted for brevity [...] */

    // Read device_id
    char *device_id_path = emalloc(sizeof(char) * (path_size + strlen("device")));
    sprintf(device_id_path, "%s/%s/%s", PCI_PATH, dev->path, "device");

When calculating path_size, the code correctly adds 2 bytes for the extra slashes introduced by the sprintf() call. However, both emalloc() calls fail to add another extra byte for the NUL terminator.


Commit 90624b9 introduced the following change in src/arm/midr.c:

 char* get_str_features(struct cpuInfo* cpu) {
   struct features* feat = cpu->feat;
-  uint32_t max_len = strlen("NEON,SHA1,SHA2,AES,CRC32,") + 1;
+  uint32_t max_len = strlen("NEON,SHA1,SHA2,AES,CRC32,SVE,SVE2") + 1;
   uint32_t len = 0;
   char* string = ecalloc(max_len, sizeof(char));

If you look carefully, you'll notice the old string was terminated with a comma, whereas the new one is not.

@Dr-Noob
Copy link
Owner

Dr-Noob commented Aug 19, 2024

Thank you for reporting this.

I haven't noticed it in any of my ARM machines because at least the first one will only occur if the SoC cannot be detected with the "main" detection methods. Strangely, it was not noted even for users that would run that function (e.g., #262). It's a pity v1.06 comes with this, but at least it's fixed now on master. Can you confirm if this works for you?

Also, would you mind opening another issue with the machine where it was crashing without this last patch? I assume it will give you unkown SoC. Thanks!

@Dr-Noob Dr-Noob added the bug Something isn't working label Aug 19, 2024
@suve
Copy link
Author

suve commented Aug 19, 2024

first one will only occur if the SoC cannot be detected with the "main" detection methods

That seems to be precisely the case here.

SoC:                 Unknown
Technology:          Unknown
Microarchitecture:   Neoverse N1
Max Frequency:       3.000 GHz
Cores:               80 cores
Features:            NEON,SHA1,SHA2,AES,CRC32
Peak Performance:    3.84 TFLOP/s

It's a pity v1.06 comes with this, but at least it's fixed now on master. Can you confirm if this works for you?

I patched the program myself, but looking at commit aa94389, our fixes are identical. I can confirm that the program works fine with those changes.

Also, would you mind opening another issue with the machine where it was crashing without this last patch?

The second issue was actually something I've stumbled across by accident while reading the code; it didn't actually trigger a crash on this machine. Sorry, should have made that clear.

@Dr-Noob
Copy link
Owner

Dr-Noob commented Aug 19, 2024

Yeah well, I meant opening a new issue to track the unknown SoC (the machine you have showed). But we can try to solve that here if you want.

If that's the case, what machine are we talking about in this case? Model? Also, what is the output of cpufetch --verbose? cat /proc/device-tree/compatible? Anything that could be indicating the model under lspci -nn?

@suve
Copy link
Author

suve commented Sep 4, 2024

So the story is, I maintain the cpufetch package in Fedora Linux. I observed these crashes when trying to build the package using the official distro builders. Since I have rather limited access to these machines, I asked for help on the devel mailing list. Here's the answer from Kevin Fenzi:

So, there's 4 kinds of aarch64 builders:

1. Our oldest ones, buildhw-a64-01/02,07/08/09/10/11/12/13/21/22:

These are lenovo emags.
lscpu says:

Vendor ID:                APM
  BIOS Vendor ID:         Ampere(TM)
  Model name:             -
    BIOS Model name:      eMAG   CPU @ 3.0GHz

cpufetch says:

[WARNING]: SoC detection failed using /proc/cpuinfo: No string found
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-efuse0/nvmem: No such file or directory
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-otp0/nvmem: No such file or directory
[WARNING]: guess_soc_from_uarch: No uarch matched the list
[WARNING]: guess_soc_from_pci: No PCI device matched the list

                                          SoC:                 Unknown
   #####  ##   # #####  ## ####  ######   Technology:          Unknown
 ###    ####   ###      ####  ###   ###   Microarchitecture:   Xgene
###       ##   ###      ###    ##    ###  Max Frequency:       3.000 GHz
 ###    ####   ###      ###    ##    ###  Cores:               32 cores
  ######  ##   ###      ###    ##    ###  Features:            NEON,SHA1,SHA2,AES,CRC32
                                          Peak Performance:    768.00 GFLOP/s
No /proc/device-tree/compatible

2. We have two thunderx2's, buildhw-a64-19/20:

Vendor ID:                Cavium
  BIOS Vendor ID:         Cavium Inc.
  Model name:             ThunderX2-99xx
    BIOS Model name:      Cavium ThunderX2(R) CPU CN9975 v2.1 @ 2.20GHz CN9975-2200BG4077-PR21
                          -Y-G CPU @ 2.2GHz

                                          SoC:                 Unknown
   #####  ##   # #####  ## ####  ######   Technology:          Unknown
 ###    ####   ###      ####  ###   ###   Microarchitecture:   ThunderX2 99XX
###       ##   ###      ###    ##    ###  Max Frequency:       ~2.140 GHz
 ###    ####   ###      ###    ##    ###  Cores:               224 cores
  ######  ##   ###      ###    ##    ###  Features:            NEON,SHA1,SHA2,AES,CRC32
                                          Peak Performance:    3.83 TFLOP/s
There's a ton of lines for each cpu like: 
...
[WARNING]: Could not open '/sys/devices/system/cpu/cpu223/cpufreq/cpuinfo_max_freq'
[WARNING]: Unable to fetch max frequency for core 223. This is probably because the core is offline
[WARNING]: Could not open '/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq'
[WARNING]: Unable to find max frequency from udev, measuring CPU frequency
[WARNING]: Running frequency measurement with 5250000 iterations on core 0...
cpufetch is measuring the max frequency...[WARNING]: SoC detection failed using /proc/cpuinfo: No string found
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-efuse0/nvmem: No such file or directory
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-otp0/nvmem: No such file or directory
[WARNING]: guess_soc_from_uarch: No uarch matched the list
[WARNING]: guess_soc_from_pci: No PCI device matched the list

3. We have 4 newer mt snow boxes: buildhw-a64-03/04/05/06

Vendor ID:                ARM
  BIOS Vendor ID:         Ampere(R)
  Model name:             Neoverse-N1
    BIOS Model name:      Ampere(R) Altra(R) Processor Q80-30 CPU @ 3.0GHz
    BIOS CPU family:      257
    Model:                1

[WARNING]: SoC detection failed using /proc/cpuinfo: No string found
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-efuse0/nvmem: No such file or directory
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-otp0/nvmem: No such file or directory
[WARNING]: guess_soc_from_uarch: No uarch matched the list
[WARNING]: guess_soc_from_pci: No PCI device matched the list

                                          SoC:                 Unknown
   #####  ##   # #####  ## ####  ######   Technology:          Unknown
 ###    ####   ###      ####  ###   ###   Microarchitecture:   Neoverse N1
###       ##   ###      ###    ##    ###  Max Frequency:       3.000 GHz
 ###    ####   ###      ###    ##    ###  Cores:               80 cores
  ######  ##   ###      ###    ##    ###  Features:            NEON,SHA1,SHA2,AES,CRC32
                                          Peak Performance:    3.84 TFLOP/s

4. There's a bunch of buildvm-a64's on newer mt snow boxes:

Vendor ID:                ARM
  BIOS Vendor ID:         QEMU
  Model name:             Neoverse-N1
    BIOS Model name:      virt-rhel9.2.0  CPU @ 2.0GHz
    BIOS CPU family:      1

...
[WARNING]: Could not open '/sys/devices/system/cpu/cpu11/cpufreq/cpuinfo_max_freq'
[WARNING]: Unable to fetch max frequency for core 11. This is probably because the core is offline
[WARNING]: Could not open '/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq'
[WARNING]: Unable to find max frequency from udev, measuring CPU frequency
[WARNING]: Running frequency measurement with 7250000 iterations on core 0...
cpufetch is measuring the max frequency...[WARNING]: SoC detection failed using /proc/cpuinfo: No string found
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-efuse0/nvmem: No such file or directory
[WARNING]: read_file: /sys/bus/nvmem/devices/rockchip-otp0/nvmem: No such file or directory
[WARNING]: guess_soc_from_uarch: No uarch matched the list
[WARNING]: guess_soc_from_pci: No PCI device matched the list

                                          SoC:                 Unknown
   #####  ##   # #####  ## ####  ######   Technology:          Unknown
 ###    ####   ###      ####  ###   ###   Microarchitecture:   Neoverse N1
###       ##   ###      ###    ##    ###  Max Frequency:       ~2.990 GHz
 ###    ####   ###      ###    ##    ###  Cores:               12 cores
  ######  ##   ###      ###    ##    ###  Features:            NEON,SHA1,SHA2,AES,CRC32
                                          Peak Performance:    574.08 GFLOP/s

Looking at my build history, the crashing builds occurred on type 1, 3 and 4.

Sorry for the late answer. Feel free to ask questions in the mailing list thread to avoid using me as an intermediary.

@Dr-Noob
Copy link
Owner

Dr-Noob commented Sep 5, 2024

I think participating in the mailing list is the wrong thing to do for me. The issues should all be collected in one place (here). Spreading them across different places does not seem the most convinient approach for a developer. The issues page is always open for those who want to report any issue. In particular, I'm interested in knowing more about the sentence:

"I suspect upstream needs to generalise the way they deal with some of the pieces, happy to assist with comments upstream."

I'm afraid that if we want to keep the customized logo for each vendor and the manufacturing process this is going to be hard to accomplish.

From the machines you have shown, the first and the second one could be potentially fixed, for which I'd need the output of hexdump -C /proc/device-tree/compatible (I'm actively working on adding support for ARM servers before they are reported in the issues page but it's not my priority right now. I'd be happy to accept PR to improve the support for this). The third machine is already supported (#262, you would need to build from source) and the last one seems to be a VM, which I don't support.

@ausil
Copy link

ausil commented Sep 10, 2024

I know none of those machines use device tree, they all use ACPI to define the hardware.

@Dr-Noob
Copy link
Owner

Dr-Noob commented Sep 10, 2024

In such case, what do you think would be a non-root approach to uniquely identify the SoC?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants