Merlin: HugeCTR V3.8 (Merlin 22.07)
What's New in Version 3.8
-
Sample Notebook to Demonstrate 3G Embedding:
This release includes a sample notebook that introduces the Python API of the
embedding collection and the key concepts for using 3G embedding.
You can view HugeCTR Embedding Collection
from the documentation or access theembedding_collection.ipynb
file from the
notebooks
directory of the repository. -
DLPack Python API for Hierarchical Parameter Server Lookup:
This release introduces support for embedding lookup from the Hierarchical
Parameter Server (HPS) using the DLPack Python API. The new method is
lookup_fromdlpack()
. For sample usage, see the
Lookup the Embedding Vector from DLPack
heading in the "Hierarchical Parameter Server Demo" notebook. -
Read Parquet Datasets from HDFS with the Python API:
This release enhances theDataReaderParams
class with adata_source_params
argument. You can use the argument to specify
the data source configuration such as the host name of the Hadoop NameNode and the NameNode port number to read from HDFS. -
Logging Performance Improvements:
This release includes a performance enhancement that reduces the performance impact of logging. -
Enhancements to Layer Classes:
- The
FullyConnected
layer now supports 3D inputs - The
MatrixMultiply
layer now supports 4D inputs.
- The
-
Documentation Enhancements:
- An automatically generated table of contents is added to the top of most
pages in the web documentation. The goal is to provide a better experience
for navigating long pages such as the
HugeCTR Layer Classes and Methods
page. - URLs to the Criteo 1TB click logs dataset are updated. For an example, see the
HugeCTR Wide and Deep Model with Criteo
notebook.
- An automatically generated table of contents is added to the top of most
-
Issues Fixed:
- The data generator for the Parquet file type is fixed and produces consistent file names between the
_metadata.json
file and the actual dataset files.
Previously, running the data generator to create synthetic data resulted in a core dump.
This issue was first reported in the GitHub issue 321. - Fixed the memory crash in running a large model on multiple GPUs that occurred during AUC warm up.
- Fixed the issue of keyset generation in the ETC notebook.
Refer to the GitHub issue 332 for more details. - Fixed the inference build error that occurred when building with debug mode.
- Fixed the issue that multi-node training prints duplicate messages.
- The data generator for the Parquet file type is fixed and produces consistent file names between the
-
Known Issues:
-
Hybrid embedding with
IB_NVLINK
as thecommunication_type
of the
HybridEmbeddingParam
class does not work currently. We are working on fixing it. The other communication types have no issues. -
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
-
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR. -
The number of data files in the file list should be greater than or equal to the number of data reader workers.
Otherwise, different workers are mapped to the same file and data loading does not progress as expected. -
Joint loss training with a regularizer is not supported.
-