Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on Silifuzz measurement results on CloudLab #3

Open
Maknee opened this issue Dec 9, 2022 · 4 comments
Open

Questions on Silifuzz measurement results on CloudLab #3

Maknee opened this issue Dec 9, 2022 · 4 comments

Comments

@Maknee
Copy link

Maknee commented Dec 9, 2022

Hi Silifuzz,

We used Silifuzz to run a large-scale measurement (10K hours in total) on 200 CloudLab machines, to understand more about SDC characteristics. The detailed setups are attached below.

Surprisingly, we didn’t observe any SDC besides a false positive (which I reported in October).

The Google and Meta papers report that “on the order of a few mercurial cores per several thousand machines” and a “SDC occurrence rate of one in thousand silicon devices”.

I’m writing to ask whether you have any insights on our observation? Are SDCs you observed specific to certain CPU families (like Intel CPUs)? Or, you suggest an even large-scale testbed than CloudLab?

Our measurement setups are:

  • Unique snapshots: 17,333
  • Total hours spent executing: 10,031
  • Machines: 200
  • CPU: Intel Xeon 4 core E5530 processors (50 machines) and Intel Xeon 8 core E5-2630v3 processors (150 machines)
  • Total execution per machine : ~50 hrs

We made a few changes (https://github.com/xlab-uiuc/SDCBench):

  • edited Silifuzz to collect and organize the generated snapshots before the feature was added to the repository (Add a script to collect corpus from fuzzing results for real CPU tests #1)
  • edited Silifuzz to run and report results at scale. The framework generates a snapshot to run on all machines and the machines report back their numbers to be stored into a database
@ksteuck
Copy link
Collaborator

ksteuck commented Dec 13, 2022

hi @Maknee

Thanks for taking the time to try the tool and provide feedback!
I cannot go into a lot of details but let me provide some quotes that may give you insight regarding the scale of the fleet and the scanning infrastructure backing the findings reported in the papers.

  • "Our current corpus contains snapshots generated by fuzzing Unicorn and XED with libFuzzer and a number of random snapshots produced by ifuzz—about 500,000 in total" (the SiliFuzz paper)
  • "We fuzz proxies, such as software CPU simulators and disassemblers, using traditional software fuzzing techniques, and then use the accumulated corpus to cross-check millions of CPU cores" (the SiliFuzz paper)
  • "Corruption rates vary by many orders of magnitude (given a particular workload or test) across defective cores, and for any given core can be highly dependent on workload and on f, V, T" (Cores that don't count)
  • "We have a modest corpus of code serving as test cases, selected based on intuition we developed from experience with production incidents, core-dump evidence, and failure-mode guesses. This corpus includes real-code snippets, interesting libraries (e.g., compression, hash, math, cryptography, copying, locking, fork, system calls), and specially-written tests, some of which came from CPU vendors." (Cores that don't count)
  • "When a fleet is made up of millions of machines spread across multiple regions and fault domains, it is important that testing is efficient and tactical" (the Meta paper)

@ksteuck
Copy link
Collaborator

ksteuck commented Dec 14, 2022

One actionable thing you can do is try to find better proxies to further explore the "fuzzing by proxy" concept. Potential candidates include XED, Unicorn v2, Bochs and any other x86 emulators you may have access to.

@tianyin
Copy link

tianyin commented Dec 15, 2022

@ksteuck Thank you for the answer!

We did read all the papers you quoted (Cores don't count, the Meta ones, and the Sillifuzz paper) and that's exactly where our questions came from :)

Basically we run Silifuzz on the largest fleet we can find in academia, i.e., CloudLab, but we were not able to observe SDCs.

We wonder whether it means we simply don't have a large enough fleet to measure/observe SDCs in an academic setting, or we measure the wrong CPUs, or we are doing something wrong that failed to capture SDCs? We'd love to hear your thoughts!

Regarding your last point, we ran the open-source Sillifuzz (which I believe uses Unicorn). Are you hinting that using a different one (e.g., XED) can better expose SDCs?

@ksteuck
Copy link
Collaborator

ksteuck commented Dec 16, 2022

We wonder whether it means we simply don't have a large enough fleet to measure/observe SDCs in an academic setting, or we measure the wrong CPUs, or we are doing something wrong that failed to capture SDCs?

The answer may be "all of the above". I'm not familiar with the details of your setup but I can address the question of scale and content quality based on what has been published on the topic:

  • Google and others have reported running on millions of cores. 200 is simply not the scale where SDCs are readily observable. You may have been running on 200 perfectly healthy machines
  • The quality of test content is important. SiliFuzz is just one tool for automatic test content generation. The 1000DPM number quoted in the OP is the result of testing millions of cores with a diverse set of content under different f/V/T over prolonged periods of time (not just in terms of raw machine hours but also at different moments of the life cycle). Quoting the silifuzz paper: "while for CPU defects we have to test every individual core repeatedly over its lifetime due to wear and tear. ")

Regarding your last point, we ran the open-source Sillifuzz (which I believe uses Unicorn). Are you hinting that using a different one (e.g., XED) can better expose SDCs?

SiliFuzz itself is proxy-agnostic. We provide a sample Unicorn-based proxy but what I was suggesting is exploring other proxies to improve coverage. Better proxies should provide better coverage (or at least that's the heuristic behind SiliFuzz). Unicorn is a rather "weak" proxy.

Finally, we've recently published a corpus of about 300k snapshots based on Unicorn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants