Where can I find the latest results of the benchmark? #145

sirandreww · 2024-10-21T08:45:42Z

Edit (Steffan): Tasks

Conduct a (regular) survey / competition and publish results.
List of publications.

SSoelvsten · 2024-10-21T12:31:55Z

What kind of "results" are you thinking of? As far as I know, this set of benchmarks has not yet been used as a survey of BDD packages. If it was done, more BDD packages should be added (see also #12). I hadn't thought to maintain a document with such information on this repository.

The best I can do, is create a list with papers that have used this set of benchmarks? But, those papers focus on a specific BDD package. That is, they often give a relative rather than an absolute view.

sirandreww · 2024-10-21T14:16:34Z

Adding a list of papers that contain benchmark results would be excellent.
I think it would be great if one could have a graph in the README comparing the different implementations by speed and memory usage. That way the performance differences of these libraries is known, which would drive their improvement further.

SSoelvsten · 2024-10-22T08:11:55Z

With respect to : List of Publications

I have pushed a list of publications in dc57f16 .

@nhusung , I assume Configuring BDD Compilation Techniques for Feature Models didn't use this benchmarking suite but rather the newly added CNF benchmark is a generalisation for future work? Hence, it shouldn't be added to the list.

nhusung · 2024-10-22T08:29:45Z

Yes, for that paper, we directly used oxidd-cli. The recently added CNF benchmark is based on our findings in Configuring BDD Compilation Techniques for Feature Models, and corresponds to the “balanced” construction.

SSoelvsten · 2024-10-22T09:55:52Z

With respect to : Results

There are a lot of different variables one can tweak with respect to (1) the machine of choice, (2) the common settings used, and (3) the settings of each individual BDD package.

Example for (1) and (2): All my usages have been on machines with vast amounts of memory. This has primarily been due to my need for other BDD packages to survive some of the larger instances within Adiar's scope. Yet, is something like that reasonable for an "average" use-case?
Example for (1) and (2): If less memory is used, then other BDD packages' performance may suffer. For example, here is the running time of Adiar and BuDDy depending on the amount of memory for the RelProd benchmark on three different models:
Example for (3): If you compare Sylvan's performance in [Sølvsten22] with [Husung24], there is an odd slowdown for smaller instances (12-Queens). This is due to not using the same initial table size.
Example for (3): Some benchmarks are only solveable by Sylvan and BuDDy if they instantiate the node table to its full size immediately. For others, the opposite is true. See the TACAS 2022 artifact for the details.
Example for (3): While chatting with Bryant, he mentioned he was able to get CUDD to run faster by changing some of its magical constants (which are still fine-tuned for old hardware).

There exists the following comparative studies of BDD packages. Most of them are (unsurprisingly) from the 90s; so, a more up-to-date one would be beneficial to the community.

Again note, there are quite a lot of choices to be made when trying to make a "fair" comparison.

Things to work out

I'd have to think a bit about how I/we could do something like this (and how to keep it somewhat up to date). It very quickly turns into me (1) conducting a survey and/or (2) creating a BDD competition. I'm not entirely sure, I am keen on taking on either, right now.

These are just a few things one should have to make decisions on.

Time
- Actual clock time should be measured to also capture issues with the TLB and the cache.
- Should a timeout be used? If so, what should it be? A low timeout would be heavily biased against Adiar (small BDD computations are out of its scope).
- Currently, I have always ignored the initialisation time. This is because BuDDy spends a long time initialising the node table. Including it would muddy the picture (without changing the results significantly). So, should initialisation time be measured independently?
Memory:
- How much memory is "not enough" and how much is "too much"?
- Should memory be completely fixed or everything repeated with different orders of magnitude of RAM?
- For Adiar, how much disk space (compared to RAM) should be allowed?
- Should other BDD packages be allowed to use SWAP?
- How should memory usage even be measured? For example, BuDDy is currently set up to just take all memory immediately. So, should the number of allocated nodes be used instead?
Multi-threading:
- Sylvan and OxiDD are designed for multi-threaded computation. Yet, comparing them with multiple workers to single-threaded implementations is not really fair. But, not exploring their possible speed-up is also unfair. Hence, should multi-threaded implementations be separately compared?
- Should thread-safe implementations, i.e. Sylvan, OxiDD, and LibBDD (and in the future Adiar), also be separately benchmarked?

For this to be any use, one should then also need to include more BDD packages ( see also #12 ).

sirandreww · 2024-10-23T10:42:18Z

I think the best approach would be to provide all the relevant data and let the user decide. And since BDDs might be used in similar ways to SAT solvers, we can take inspiration from the SAT competition.

A timeout of 1 hour should be enforced.
That time to solve should also be recorded.
Initialization time should be recorded separately only if this is easy to do.
A memory limit of let's say 16GB should be enforced? Or alternatively different memory limits should be enforced.
Memory consumption should be measured in GB to remain implementation independent.
Maybe BDD libraries that use the disk should be have their own tracks?
Multi-threaded libraries should participate in two tracks, single core and multi core tracks.
Thread safety is an implementation detail in my opinion, so it shouldn't matter when it comes to performance comparisons.

In conclusion 3 tracks:

Single Core (Not allowed to use the disk)
Multi Core (Not allowed to use the disk)
Disk based (with multiple cores)
The mem limit for each track is not clear.

SSoelvsten · 2024-10-28T10:35:35Z

Notes on "BDD Competition" Rules

To quote Cimatti et al.: "BDD-based and SAT-based model checking are often able to solve different classes of problems, and can therefore be seen as complementary techniques.". But, yes taking inspiration from the SAT competition and similar is a good idea anyway.

The Disk-based track would merely be Adiar beating CAL by several orders of magnitude. For that, you can just read my papers. So, this can just be skipped and Adiar and CAL be excluded from the whole ordeal.

The motivating example for the BeeDeeDee package and the Flix Compiler rely on the BDD package being thread-safe. So yes even though multi-threading is important for speed-ups (especially due to the CPU developments in the past decade), thread-safety opens up for new usages of BDDs (unlike multi-threading). So, a thread-safe track would actually be useful. Many of the current benchmarks could probably be made multi-threaded themselves.

Notes on Hardware

One could probably make this work as multiple larger GitHub action workflows - assuming they are somewhat consistent. The overall idea would be something akin to:

One job for each sample
- Taking somewhere between 3 and 5 samples per data point should be sufficient.
- Take the median to exclude exceptionally bad or good states of the machine.
- Store the result as an artifact
Afterwards, analyse all samples.
- Load all artifact results
- Use a need to wait until all samples have been taken.
- Push the results as files (plots, tables, and more) to a results branch.

Then, a fixed document can combine them into a single readable document (set up to be the GitHub page). This whole thing can then be run whenever relevant.

SSoelvsten added the question Further information is requested label Oct 21, 2024

SSoelvsten added the 🧑‍🔬 survey An up-to-date comparative study on performance label Oct 22, 2024

SSoelvsten pinned this issue Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where can I find the latest results of the benchmark? #145

Where can I find the latest results of the benchmark? #145

sirandreww commented Oct 21, 2024 •

edited by SSoelvsten

Loading

SSoelvsten commented Oct 21, 2024 •

edited

Loading

sirandreww commented Oct 21, 2024

SSoelvsten commented Oct 22, 2024 •

edited

Loading

nhusung commented Oct 22, 2024

SSoelvsten commented Oct 22, 2024 •

edited

Loading

sirandreww commented Oct 23, 2024

SSoelvsten commented Oct 28, 2024

Where can I find the latest results of the benchmark? #145

Where can I find the latest results of the benchmark? #145

Comments

sirandreww commented Oct 21, 2024 • edited by SSoelvsten Loading

SSoelvsten commented Oct 21, 2024 • edited Loading

sirandreww commented Oct 21, 2024

SSoelvsten commented Oct 22, 2024 • edited Loading

nhusung commented Oct 22, 2024

SSoelvsten commented Oct 22, 2024 • edited Loading

sirandreww commented Oct 23, 2024

SSoelvsten commented Oct 28, 2024

sirandreww commented Oct 21, 2024 •

edited by SSoelvsten

Loading

SSoelvsten commented Oct 21, 2024 •

edited

Loading

SSoelvsten commented Oct 22, 2024 •

edited

Loading

SSoelvsten commented Oct 22, 2024 •

edited

Loading