Releases: NVlabs/NVBit
Releases · NVlabs/NVBit
NVBit-1.7.2
- [API change]
nvbit_set_at_launch(CUcontext ctx, CUfunction func, uint64_t param_val, CUstream custream = nullptr,uint64_t launch_handle = 0)
now accepts parameter value instead of a pointer to the parameter. The newly added custream and launch_handle are provided and used during nvbit_at_graph_node_launch() to help set the parameter for CUDA graph kernel node. - Improved cubin compatibility
- Fixed SASS instruction parsing
- Improved CUDA graph support
- [experimental] Changed mem_trace to support CUDA graph.
- Fixed related function detection for the function pointer case.
NVBit-1.7.1
- Improved CUDA program compatibility
- Fixed related function discovery on SM80 (close #129).
- Updated license headers.
NVBit-1.7
NVBit 1.7 contains a lot of changes (both NVBit core and NVBit tools) to support CUDA 12. Please read the change log carefully and follow the migration guide to port your pre CUDA 12 NVBit tools to this new release, otherwise your NVBit is very likely not to work in CUDA 12 environment.
Changes and migration guide:
- Added Orin
SM_87
, Ada LovelaceSM_89
, HopperSM_90
, support. - Due to potential deadlock during initialization of application, NVBit disables module lazy loading by default: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#possible-issues-when-adopting-lazy-loading. If wanted, user can try to set NO_EAGER_LOAD=1 to enable module lazy loading.
- NVBit tools can no longer use syscalls in instrument functions, therefore printf() and assert() are no longer allowed in the injected functions. Any use of printf() or assert() will prevent your tool from loading and cause application error. As a result, mem_printf example is removed. Instead, tool writers will need to format and transfer their messages on their own. A skeleton example is provided as mem_printf2, which is built on top of mem_trace and requires tool writers to add a string formatter.
- Revised nvbit_at_ctx_init()/nvbit_at_ctx_term() callback rules:
a. CUDA API calls are no longer allowed in the nvbit_at_ctx_init() callback function, please use they in the new nvbit_tool_init() callback function instead. Because CUDA API calls take the same lock which is already taken by CUDA driver (CUDA 12+) at context creation time when Nvbit_at_ctx_init() is invoked, whereas nvbit_tool_init() is invoked before first CUDA kernel launch without taking the lock. Failure to make this change will result in your tool deadlocking. NVBit will warn you about this change, set ACK_CTX_INIT_LIMITATION=1 to acknowledge and disable the warning.
b. Launching a kernel, allocating device or managed memory are no longer allowed in the nvbit_at_ctx_term() callback function, due to a similar locking issue. Failure to make this change will result in your tool deadlocking. - Rewrote mem_trace example to adapt to CUDA 12 changes by following the new nvbit_at_ctx_init()/nvbit_at_ctx_term() callback rules above. Please read the changes from mem_trace carefully and adapt your tool accordingly if it uses ChannelDev and ChannelHost from utils/channel.hpp.
- Added support for cudaLaunchKernelEx (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1gb9c891eb6bb8f4089758e64c9c976db9) API for all tools. Please update your tools to catch all kernel launches during instrumentation.
- NVBit tools are now compiled with arch=all by default to be able to run on all GPU architectures. To reduce tool compilation time and binary size, run
make ARCH=sm_XX
when you are planning to only run your tool on sm_xx GPU architecture. - ppc64le support is dropped.
NVBit-1.5.5
Fixed
- Fixed instrumentation of relative control flow instructions in Maxwell/Pascal.
Note: the ppc64le version is compiled but not tested on real machines.
NVBit-1.5.4
Fixed
- Fixed instruction size mismatch, i.e., possible wrong value from
Instr::getSize()
. - Fixed
nvbit_{read,write}_{ureg,preg_reg,upred_reg}
functions.
Changed
- Updated CUDA header files to CUDA 11.5
- Better error messages for temporary file creations.
Note: the ppc64le version is compiled but not tested on real machines.
NVBit-1.5.3
Fixed
- Added missing surface and texture MemorySpaceStr.
- Fixed LDGSTS address generation issues.
Added
- Added SM_86 support.
Changed
- Changed mem_trace to work with multi-context workloads.
NVBit-1.5.2
Fixed
- Fixed a bug in Turing+ architectures causing program state corruption due to using printf in instrumentation functions.
- Fixed a bug in public NVBit decoding functions during Texture and Surface instruction decoding.
Added
- Added an example of instrumenting programs that use CUDA graphs.
NVBit-1.5.1
Fixed
- Fixed instruction decoding bugs in Turing and Ampere.
- Fixed a cubin parsing bug.
NVBit-1.5
Changed
- Changed
*_pred
functions/variables to*_guard_pred
to avoid confusion, since some SASS instructions also use as operands predicate register, which is different from guard predicate. - Moved instruction types to InstrType namespace from Instr class.
- Renamed class/enum type names: memOpType -> MemorySpace, memOpTypeStr -> MemorySpaceStr, operandType -> OperandType, operandTypeStr -> OperandTypeStr, regModiferType -> RegModiferType, regModiferTypeStr -> RegModifierTypeStr.
- Added a new
str
variable, storing the parsed operand, tooperand_t
.
Added
- Added support for native compilation of tools targeting SM arch >= 70 (up to the currently supported arch). Previously a compilation required targeting PTX< SM70 even when running on Volta+
Fixed
- Removed unused
mref_t
variable. - Fixed some bugs on Turing and Ampere.
- Fixed bug in instrumentation function stack calculation which resulted in segmentation fault on pbrt (#28) due to possible nested device function calls.
Removed
- Removed obsolete custom implementations of shuffle and ballot from utils.h, which was implemented to support old nvcc
NVBit-1.4
Added
- Added complete Turing support, specifically
SM_73
andSM_75
. - Added Ampere support, specifically
SM_80
.- Added new
GLOBAL_TO_SHARED
memory space forLDGSTS
instruction from Ampere.
- Added new
- Added
nvbit_read_ureg
,nvbit_write_ureg
to read/write uniform registers. - Added
nvbit_read_pred_reg
,nvbit_write_pred_reg
to read/write predicate register. - Added
nvbit_read_upred_reg
,nvbit_write_upred_reg
to read/write uniform predicate register. - Added
NVBIT_VERSION
to nvbit.h, so one can identify NVBit version in his instrumentation tools - Added variadic instrument function support (with
record_reg_vals
as an example), so one can write instrument functions likedev_func(int num_args...)
.
Fixed
- Fixed the bug, which prevents callee functions from being instrumented if their caller function has no instruction to be instrumented.
Changed
IARG_PRED_VAL_T
andIARG_PRED_REG_T
give uniform predicate register value if the instrumented instruction uses uniform predicate register.- Changed
move_replace
tool to support uniform register. - Changed
mem_trace
andmem_print
tools to support instructions with more than one memory reference address (e.g., LDGSTS). - Changed
nvbit_enable_instrumented
function to allow users to only enable/disable instrumentation on the specified function without affecting its related functions (the original and default behavior is to enable/disable instrumentation on the specified function and all its related functions).