Releases: leobago/fti
Irun
- Released Spack package (https://github.com/leobago/fti-spack)
- Asynchronous HDF5 single file creation
- New installation folder structure (includes config file template, documentation, examples, etc)
- CI revision (bug fixes, new structure, compiler updates, added MPICH)
- Improved CPP interoperability (included into CI)
- Improved installation script (install.sh)
- Revised coding style constrains
- Minor patches (dcp, macros, testing)
Rabat
This release breaks compatibility with a few FTI functions for handling data types. Please check the documentation for details:
https://fault-tolerance-interface.readthedocs.io/en/latest/compatibility_notes.html
- Added support for defining non-opaque composite data types in Fortran when using HDF5;
- Added native and non-opaque bindings for Fortran Complex data type when using HDF5;
- Simplified the composite data type handling C API;
- Enhanced CMake configuration exports to include linked libraries and compiler definitions;
- Enhancement of the pre-commit hook including bug fixes.
- Implementation of Fast-Forward feature to allow checkpoints to be taken at sub-minute magnitude.
- Implementation of a checkpoint processor for reading and decoding FTI's checkpoints. Newer features are implemented including hdf5 write, N-dimentional variable support, last-checkpoint processing, and usage examples.
- New API function FTI_SetAttribute that allows adding descriptive attributes to the protected datasets
Florianopolis
- GetConfiguration feature: follow-up of issue #250
- Register a virtually unlimited number of protections with FTI_Protect
- Asynchronous postprocessing of shared HDF5 checkpoint file
- Enhancement of the ICR feature by separating variable recovery from the I/O tasks for all I/O modes
- IME I/O interface (IME native API)
- Refactoring of FTI-I/O interfaces (HDF5, SIONlib, MPI-I/O, FTI-FF)
- Addition of the FTI Integrated Test Framework (ITF), a tool to develop FTI Integration tests
- Complete refactor of local integration tests behavior into the new ITF format
- Removal of deprecated files in the local tests directory
- Integration of ITF into CMake test runner tool (CTest) and Jenkins
- Expanded CI tests with local tests for all I/O libraries
- Simplified runtime metadata handling in FTI (for FTI developers)
- Implementation of key-value storage for protected variables (for FTI developers)
- Documentation for how to use, develop tests and expand ITF added to the developer guide section
- Configuration of ReadTheDocs for an auto-build of the documentation from Doxygen's resources using Breathe and Sphinx
Heraklion
Release
This release includes the new version of differential checkpointing, a complete implementation of incremental checkpointing, full support for GPU checkpointing and full support for HDF5 checkpointing, including the option for checkpointing into a single file (N-1) and restarting with a different number of processes.
Changelog
- New major feature allowing users to checkpoint data allocated in the GPU device memory.
- New implementation of differential checkpointing that addresses performance issues for highly fragmented differential updates.
- New major feature allowing users to use incremental checkpointing for CPU and GPU data by adding one by one the variables to the checkpoint file.
- New major feature for DCPPosix allowing to recover from last non-corrupted checkpoint file.
- New examples in the examples/GPU directory that checkpoint GPU data.
- New major feature allowing to restart with a different number of processes using a shared HDF5 checkpoint file.
- New unitary tests for the new features.
- New configurable/flexible local test structure.
- Fixed Bug of RecoverVar.
- Fixed Bug on DCP recovery.
- Complete and full code documentation generated with Doxygen.
Cologne
- Fix for bug to find MPI with PGI compilers.
- New option to avoid killing head processes and let the user handle it.
- New major feature allowing users to use differential checkpointing including a new FTI file format.
- New major feature allowing users to keep ALL L4 checkpoints.
- New major feature allowing users to leverage FTI for general asynchronous I/O outside of checkpointing.
- New unitary tests for the new features and a few more with improved performance for the previous ones.
- Complete and full code documentation generated with Doxygen.
- Vastly improved User guide showcasing the new features of this release.
Coruna
- Fixed corner case bug when a specific failure type happens just after a restart.
- Fixed some unitary tests that were launched with the wrong arguments.
- Fixed bug on FTI File Format generating a segmentation fault.
- Implementation of checkpointing in HDF5 format.
- Support for HDF5 groups added to allow more flexibility.
- Additional unitary tests added to check HDF5 support.
- Extended wiki, documentation and user guide with HDF5 and FFF information.
Barcelona
-
Switch to Jerasure 2.0
-
Support for checkpoints that dynamically change size
-
FTI_Realloc and FTI_GetStoredSize added to support evolving ckpt sizes
-
Added support for partial local recovery with FTI_RecoverVar
-
Feature to set sync time for applications with evolving iteration length
-
Fix for examples bug with Intel compilers
-
Fix for bug on Cray machines
-
Added warning when using different compilers
-
Add more unitary tests
-
Wiki pages added on Github
-
Fortran API added in the wiki and user guide
-
Improved documentation
-
Switch to BSD License
Poznan
- I/O interfaces for MPI-IO and SIONlib
- Checking checkpoint integrity with MD5 checksums
- Fixed bug for L2 and L3 checkpointing
- Fixed synchronization bugs
- Added automated testing with different compilers
- Expanded examples
- Fixed memory leaks and removed unused variables
- Local unitary tests in C and fortran
- Improved code formatting
- Updated developer documentation
Paris
- Fixed undefined behaviour while using snprintf function.
- Fixed bug when no checkpoints are taken if mean iteration time is bigger than 30 sec.
- Fixed while validating configuration for Level 2 and 3.
- Fixed bug when asked to keep last checkpoint if no checkpoint was taken during execution.
- Fixed bug of the example application crashes during recovery.
- Fixed bug when using of rint requires the math library at link.
- Added Cray Compiler names to FindMPI script
- Fixed bug for files larger than 2Gb (int -> long).
- Writing full datasets harder.
- Better documentation and NEW user guide!
Chicago
Here is the list of changes for this Release
- Fixed incorrect warnings for no-head ranks if using dedicated processes.
- Fixed problem with renaming/erasing local files if using dedicated processes.
- Fixed problem with creating files/directories which already exist and deleting files/directories which don't exist.
- Moved some global variables into static.
- Removed some unused variables or code.
- Fixed buffer overflow in jerasure library.
- Fixed some not null terminated string bugs.
- Fixed many unchecked or ignored return value form standard library functions.
- Fixed some uninitialized variable problems.
- Corrected some misleading warning/error messages.
- Fixed many resource leaks.
- Fixed many potential memory leaks.
- Fixed many TOCTOU problems.
- Added header file with declarations of all library functions.
- Added option to build examples.
- Added cmake files to build dependencies and examples.
- Cleaned library interface.