Skip to content

Releases: StanfordLegion/legion

Version 24.06.0 (June 28, 2024) – Nonidempotent Traces

28 Jun 16:22
Compare
Choose a tag to compare
  • Build
    • Minimum required C++ standard is now 17
    • Embedded GASNet build in CMake now automatically enables GPU memory kinds
  • Legion
    • Support for nonidempotent traces (where the postconditions do not imply the preconditions of the trace)
    • Deletions are now committed in program order, making it easier for users to reason about when their effects take place
    • All tasks (and other operations) are now committed in order (a prerequisite for anticipated, but not yet implemented, precise exception support)
    • Improvements to Legion's internal algorithm for virtual instances, fixing various correctness bugs in the implementation
    • Improvements to the DefaultMapper handling of task layout constraints
  • Regent
    • Improvements to make compiler more deterministic
    • Improvements to auto-detect CUDA
    • Support for complex numbers in std/format
    • Static control replication (SCR) and RDIR have been completely removed. All SCR and RDIR related flags (-fflow-*) have been removed, except for -fflow 0 which is permitted (but no longer does anything, and now issues a warning)
  • Tools
    • Restore profiler's ability to render dependent partitioning channels
    • Render mapper information on mapper calls in the profiler
    • Render user-provided profiling information in the profiler
  • Realm
    • UVM support for the HIP module
    • Error code support for command line parser
    • Support for querying MIG devices from NVML
    • Add indirection channel query
    • Additional unit tests and bug fixes

Version 24.03.0 (March 27, 2024) – Control Replication

27 Mar 16:14
Compare
Choose a tag to compare

Legion is an implicitly parallel, distributed runtime system for heterogeneous supercomputers.

The most notable feature in this release is control replication, a feature that we have been working on for many years that makes Legion dramatically more scalable in typical usage scenarios. In fact, the vast majority of users have already been using control replication, meaning that this is the first stable release of Legion which is usable (in a practical manner) for the vast majority of our users.

If you are not familiar with control replication, there is a wiki page that describes it, and of course the original paper.

As of this release, that means that the old control_replication branch is no longer being updated, and will be deleted at some point in the future. All updates from now on will go into the master branch, and it is our intention to avoid any long-standing feature branches in the future.

This release also finally removes some old Legion features that have been deprecated for nearly 10 years at this point. If you were somehow using those features, you will need to update to their replacements.

In addition, with this release, we are now packaging Legion Prof via crates.io. That means you can now install Legion Prof with:

cargo install --all-features --locked [email protected]

(Note the version format is 0.YYMM.0. This is required because Rust uses semver while Legion uses calver.)

Full release notes:

  • Build
    • ROCm 6.0 is now supported, and support for ROCm 4.x has been removed
  • Legion
    • Support for control replication has been merged
    • Support for discarding region contents on task completion
    • Long-deprecated APIs, such as the old HighLevel namespace, have been removed
  • Mappers
    • Default mapper support for control replication
    • Default and null mapper now use C++ override keyword
  • Regent
    • Support for pure projection functors that capture arguments
    • Static control replication (SCR) has been deprecated and will be removed in a future release
  • Tools
    • The profiler now correctly recognizes the logger format version and throws an error if it does not match
    • The profiler now reports when a profile was generated with debug mode (or another expensive setting) was enabled
    • Many profiler fixes for correctly rendering runtime and mapper calls
    • Profiler now renders GPU device and host execution separately
    • Optimizations to improve profiler memory usage and running time
    • Rust profiler now requires at least Rust 1.74
  • Realm
    • Support for registration of dynamically allocated buffers
    • Support for handling poisoned events for reservation
    • Refactor CUDA allocation and IPC paths
    • Support for querying CUDA device information (GPU UUID and ID),process information (process ID, hostname, host ID) and timer calibration error from the profiler
    • Remove address alignment from serializer and deserializer
    • Support for creating network shared peers using IPC mailbox
    • Support OMP thread binding and allow for multiple OMP parallel sections when enabling system OMP runtime
    • Add Realm unit tests
    • Fixes for Realm tests, sparsity map, MemoryQuery, dynamic framebuffer memory and memcpy channel

Version 23.12.0 (December 14, 2023)

14 Dec 17:41
Compare
Choose a tag to compare
  • Regent
    • Support for HIP multi-GPU per runtime
  • Realm
    • Improve scalability of startup by replacing point-to-point communication with allgatherv for machine model announcements
    • Support shared memory communication for system memory
    • Provide sanity check for GPU tasks to detect any leak of CUDA streams
    • Support for GPU transposes in CUDA-DMA
    • Bug fixes for CUDA-DMA

Version 23.09.0 (September 28, 2023)

28 Sep 23:38
Compare
Choose a tag to compare
  • Regent
    • Elide future maps in index launches
    • Improvements to Pygion interop
  • Realm
    • Add a machine configuration API that allows applications to configure the machine model without using the command line
    • Expose Realm managed CUDA/HIP stream to applications to launch GPU tasks without device-wise synchronization when hijack is disabled
    • Change timers to use rdtsc
    • Improve performance for getting highest priority task available in any task queue
    • Implement framebuffer memory with cuMemMap
    • Initial work for moving STL dependencies to header only

Version 23.06.0 (June 28, 2023)

27 Jun 17:56
Compare
Choose a tag to compare
  • Build
    • Fixes for CMake build on macOS
    • Fixes for HIP build when arch is specified
  • Realm
    • Support for better backtraces via libdw and libunwind
    • Improve scalability and performance in task spawning by caching the triggering operation of an event if one is provided
    • Fix a minor issue with affinity queries to properly clear the user-provided vector before populating it
    • Add more accurate GPU memory bandwidth affinity calculations if NVML is available
    • Refactor CPU core topology enumeration to serve systems without NUMA capabilities (like Jetson ARM systems)
    • Improve scalability and performance of task spawning by moving event reuse freelists to be per-processor, reducing lock contention
    • Add a microbenchmark for measuring task throughput more accurately
    • Add a series of Realm API tutorials
    • Replace CU_EVENT_DEFAULT with CU_EVENT_DISABLE_TIMING for better performance of CUDA events
    • Support Kokkos interop for the HIP module
    • Fixes for Realm tests on macOS
  • Tools
    • Legion Prof now supports search in the new profiler UI
    • Legion Prof now supports an HTTP client/server interface. Launch the server with --serve (on port 8080 by default) and attach a client to it with --attach http://127.0.0.1:8080
    • Legion Prof now supports a new achival mode via the --archiveflag. Generate an offline profile and view it either via --attach or by uploading it to a server and navigating to https://legion.stanford.edu/prof-viewer/?url=...
    • Legion Prof modes (client/server/viewer) are now parallel by default, and perform heavy computations off the UI thread for better responsiveness
    • Add support for rendering indirect copies (i.e., gather/scatter)
    • Fix rendering of profiles over HTTP with old profiler UI
    • Fix profiling of copies with different numbers of hops between instances

Version 23.03.0 (March 27, 2023)

27 Mar 19:01
Compare
Choose a tag to compare
  • Build
    • Minimum supported CMake version is now 3.16. (Some optional features may continue to require even newer versions.)
    • Minimum supported GCC version is now 8.
    • Minimum supported CUDA version is now 10.
  • Legion
    • Added support for padded layout constraints to provide scratch space in instances for tasks to use (see examples/padded_instances).
    • Added support for tiled layout constraints to provide an ability to layout instances by breaking down dimensions (see examples/tiling).
  • Realm
    • An experimental UCX network backend has been added.
    • Updated the Kokkos interop to support Kokkos 4.0.
  • Python
    • Support loading Legion as a library from a stock Python interpreter.
  • Regent
    • Fixes to avoid leaking futures.
    • Improvements to Regent's predicate optimization.
  • Tools
    • Legion Prof now supports a native viewer UI. Enable it with the viewer feature (e.g., cargo run --features=viewer) and use the flag --view.
    • Legion Prof now has better support for rendering a subset of available nodes. Pass all log files (from all nodes) into Legion Prof and add the --subnodes flag to specify which ones to render. This ensures all copies in/out of those nodes will be shown correctly.

Version 22.12.0 (December 30, 2022)

30 Dec 17:23
Compare
Choose a tag to compare
  • Regent
    • Support for nested predication of if and while statements
  • Realm
    • Support priorities for Copy operations
    • Support building with multiple network backends enabled, and use -ll:networks (gasnetex/gasnet1/mpi/none) to pick which one to use during runtime
    • Separate CUDA runtime from Realm by removing all references to CUDA runtime and relying only on driver API, which fixes an issue when mixing static and dynamic cudart across an application and improves Realm’s compatibility across driver versions
  • Tools
    • Legion Prof support visualization of Channel of indirect copy, and Instances being used by different operations including Task, Copy and Fill

Version 22.09.0 (September 30, 2022)

04 Oct 05:28
Compare
Choose a tag to compare
  • Python
    • Support for running packages via legion_python -m
    • Support for Jupyter Notebook on single node execution.
  • Regent
    • Deprecated support for LLVM versions less than 11 in setup_env.py. These versions will be removed in the next release. LLVM 13 is recommended, except on ARM where LLVM 11 is currently required
    • Added support for provenance for all launcher operations
    • Debug info is no longer generated by default in order to optimize compile times. To re-enable it, run with -fdebuginfo 1
  • Legion
    • Most Legion APIs now support passing a provenance string. This provenance information is passed through to tools like Legion Spy and Legion Prof so users can map what they are seeing back to their source code. In the future, provenance strings will also be used by all Legion error messages as well.
  • Realm
    • Support for fills of arbitrary instances (via multi-hop paths where needed)
    • Fixed crashes when using external instances and network-registered memory at the same time
    • Removed all direct references to CUDA runtime library in CUDA module
    • Caching of minimum-cost data transfer path for repeated copies
    • Dependent partitioning support for image and preimage using structured (~affine) transforms in addition to existing unstructured (field-based) images/preimages

Version 22.06.0 (June 29, 2022)

30 Jun 23:15
Compare
Choose a tag to compare
  • Regent
    • Support for cross-products in index launches, as well as multi-level projection functors.
    • Support for HIP on AMD GPUs has been added. All tasks marked with __demand(__cuda) are automatically eligible. Note that the name of the annotation may change in the future to something more general, but for now no change is being made. Some CUDA flags have migrated to more general names. See below.
    • The flag -fcuda 1 is deprecated. Use -fgpu cuda instead.
    • The flag -fcuda-offline is deprecated. Use -fgpu-offline instead.
    • The flag -fcuda-arch is deprecated. Use -fgpu-arch instead.
    • Enable HIP support with -fgpu hip and use the -fgpu-offline and -fgpu-arch flags as necessary/appropriate.
    • Support for new flag -ffast-math 1 which enables fast-math optimizations on CPU and GPU. By default, CPU code has this disabled, and GPU code uses only the contract flag in LLVM to generate FMA instructions. For compute-intensive applications, additional performance can sometimes be unlocked by enabling the full suite of optimizations with -ffast-math 1, at the cost of numerical accuracy.
    • Performance improvements for CUDA allow recent LLVM versions (e.g., 13) to match or exceed the performance of LLVM 3.8. Previously, performance regressions made LLVM 3.8 the most performant version for use with CUDA. The recommended LLVM version moving forward is 13, and setup_env.py has been updated to set this on all platforms.
    • The versions of GASNet and Terra are now pinned by default in setup_env.py. You can choose versions explicitly with GASNET_VERSION (as before, though the previous default was unpinned) and --terra-branch, respectively.
  • Realm
    • Allow use of system OpenMP runtime (instead of Realm-provided one) with -DLegion_OpenMP_SYSTEM_RUNTIME=ON. This allows inter-operation with libraries that have already been linked to the system runtime, but limits each process to a single OMP processor.

Version 22.03.0 (March 27, 2022)

28 Mar 21:33
Compare
Choose a tag to compare
  • Build
    • Minimum supported cmake version is now 3.7. (Some optional features continue to require even newer versions.)
  • Realm
    • Numerous bug fixes in the gasnetex network layer
    • CUDA and HIP support allow direct specification of which gpus to use via -ll:gpu_ids command-line option
    • Added support for copy paths using Cuda IPC between gpus on the same physical node
    • For applications using CUDA without the runtime API hijack AND only submitting work to the default CUDA stream, -cuda:legacysync 1 improves the overhead of detecting the completion of device-side work launched by a task
    • Realm reduction copies may now indicate exclusive access to the destination instance, improving performance by allowing simple load/store instead of atomic operations
    • Custom reduction operations (including Legion's built-in ones) can provide HIP implementations, permitting in-place reductions in HIP device memory
  • Regent
    • Support for custom serialization of types in task parameters and results
    • New experimental timing library under std/timing