Skip to content

Merlin: HugeCTR V3.8 (Merlin 22.07)

Compare
Choose a tag to compare
@minseokl minseokl released this 14 Jul 00:33
145585e

What's New in Version 3.8

  • Sample Notebook to Demonstrate 3G Embedding:
    This release includes a sample notebook that introduces the Python API of the
    embedding collection and the key concepts for using 3G embedding.
    You can view HugeCTR Embedding Collection
    from the documentation or access the embedding_collection.ipynb file from the
    notebooks
    directory of the repository.

  • DLPack Python API for Hierarchical Parameter Server Lookup:
    This release introduces support for embedding lookup from the Hierarchical
    Parameter Server (HPS) using the DLPack Python API. The new method is
    lookup_fromdlpack(). For sample usage, see the
    Lookup the Embedding Vector from DLPack
    heading in the "Hierarchical Parameter Server Demo" notebook.

  • Read Parquet Datasets from HDFS with the Python API:
    This release enhances the DataReaderParams
    class with a data_source_params argument. You can use the argument to specify
    the data source configuration such as the host name of the Hadoop NameNode and the NameNode port number to read from HDFS.

  • Logging Performance Improvements:
    This release includes a performance enhancement that reduces the performance impact of logging.

  • Enhancements to Layer Classes:

    • The FullyConnected layer now supports 3D inputs
    • The MatrixMultiply layer now supports 4D inputs.
  • Documentation Enhancements:

  • Issues Fixed:

    • The data generator for the Parquet file type is fixed and produces consistent file names between the _metadata.json file and the actual dataset files.
      Previously, running the data generator to create synthetic data resulted in a core dump.
      This issue was first reported in the GitHub issue 321.
    • Fixed the memory crash in running a large model on multiple GPUs that occurred during AUC warm up.
    • Fixed the issue of keyset generation in the ETC notebook.
      Refer to the GitHub issue 332 for more details.
    • Fixed the inference build error that occurred when building with debug mode.
    • Fixed the issue that multi-node training prints duplicate messages.
  • Known Issues:

    • Hybrid embedding with IB_NVLINK as the communication_type of the
      HybridEmbeddingParam
      class does not work currently. We are working on fixing it. The other communication types have no issues.

    • HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
      If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:

        -shm-size=1g -ulimit memlock=-1

      See also the NCCL known issue and the GitHub issue.

    • KafkaProducers startup succeeds even if the target Kafka broker is unresponsive.
      To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.

    • The number of data files in the file list should be greater than or equal to the number of data reader workers.
      Otherwise, different workers are mapped to the same file and data loading does not progress as expected.

    • Joint loss training with a regularizer is not supported.