Merge branch 'fix-doc_link_to_sok-jershi' into 'main'

Update docs See merge request dl/hugectr/hugectr!1006
NVIDIA-Merlin · Oct 24, 2022 · 11d1acc · 11d1acc
2 parents f262e9d + bbf6d0f
commit 11d1acc
Show file tree

Hide file tree

Showing 5 changed files with 14 additions and 1 deletion.
diff --git a/docs/source/api/python_interface.md b/docs/source/api/python_interface.md
@@ -643,6 +643,7 @@ It trains the model for a fixed number of epochs (epoch mode) or iterations (non
 * `snapshot`: Integer, the interval of iterations at which the snapshot model weights and optimizer states will be saved to files. This argument is invalid when embedding training cache is being used, which means no model parameters will be saved. The default value is 10000.
 
 * `snapshot_prefix`: String, the prefix of the file names for the saved model weights and optimizer states. This argument is invalid when embedding training cache is being used, which means no model parameters will be saved. The default value is `''`. Remote file systems(HDFS and S3) are also supported. For example, for HDFS, the prefix can be `hdfs://localhost:9000/dir/to/model`. For S3, the prefix should be either virtual-hosted-style or path-style and contains the region information. For examples, take a look at the AWS official [documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-bucket-intro.html).
+**Please note that dumping models to remote file system when enabled MPI is not supported yet.**
 
 ***
 
@@ -1090,7 +1091,7 @@ The stored sparse model can be used for both the later training and inference ca
 Note that the key, slot id, and embedding vector are stored in the sparse model in the same sequence, so both the nth slot id in `slot_id` file and the nth embedding vector in the `emb_vector` file are mapped to the nth key in the `key` file.
 
 **Arguments**
-* `prefix`: String, the prefix of the saved files for model weights and optimizer states. There is NO default value and it should be specified by users. Remote file systems(HDFS and S3) are also supported. For example, for HDFS, the prefix can be `hdfs://localhost:9000/dir/to/model`. For S3, the prefix should be either virtual-hosted-style or path-style and contains the region information. For examples, take a look at the AWS official [documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-bucket-intro.html).
+* `prefix`: String, the prefix of the saved files for model weights and optimizer states. There is NO default value and it should be specified by users. Remote file systems(HDFS and S3) are also supported. For example, for HDFS, the prefix can be `hdfs://localhost:9000/dir/to/model`. For S3, the prefix should be either virtual-hosted-style or path-style and contains the region information. For examples, take a look at the AWS official [documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-bucket-intro.html). **Please note that dumping models to remote file system when enabled MPI is not supported yet.**
 
 * `iter`: Integer, the current number of iterations, which will be the suffix of the saved files for model weights and optimizer states. The default value is 0.
 

diff --git a/docs/source/sparse_operation_kit.md b/docs/source/sparse_operation_kit.md
@@ -0,0 +1,9 @@
+# Sparse Operation Kit
+
+[Sparse Operation Kit (SOK)](https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/sparse_operation_kit) is a Python package wrapped GPU accelerated operations dedicated for sparse training / inference cases. It is designed to be compatible with common deep learning (DL) frameworks like TensorFlow.
+In sparse training / inference scenarios, for instance, CTR estimation, there are vast amounts of parameters which cannot fit into the memory of a single GPU. Many common DL frameworks only offer limited support for model parallelism (MP), because it can complicate using all available GPUs in a cluster to accelerate the whole training process.
+SOK provides broad MP functionality to fully utilize all available GPUs, regardless of whether these GPUs are located in a single machine or multiple machines. Simultaneously, SOK takes advantage of existing data-parallel (DP) capabilities of DL frameworks to accelerate training while minimizing code changes. With SOK embedding layers, you can build a DNN model with mixed MP and DP. MP is used to shard large embedding parameter tables, such that they are distributed among the available GPUs to balance the workload, while DP is used for layers that only consume little GPU resources.
+
+Please check this [SOK Documentation](https://nvidia-merlin.github.io/HugeCTR/sparse_operation_kit/master/index.html) for detail.
+
+<img src="user_guide_src/workflow_of_embeddinglayer.png" width="1080px" align="center"/>
diff --git a/docs/source/toc.yaml b/docs/source/toc.yaml
@@ -28,6 +28,8 @@ subtrees:
                   - file: hierarchical_parameter_server/notebooks/hps_tensorflow_triton_deployment_demo.ipynb
               - file: hierarchical_parameter_server/api/index.rst
                 title: API Documentation
+      - file: sparse_operation_kit.md
+        title: Sparse Operation Kit
       - file: performance.md
         title: Performance
       - file: notebooks/index.md

diff --git a/docs/source/user_guide_src/workflow_of_embeddinglayer.png b/docs/source/user_guide_src/workflow_of_embeddinglayer.png
diff --git a/release_notes.md b/release_notes.md
@@ -70,6 +70,7 @@ By using the interface, the input DLPack capsule of embedding key can be a GPU t
     Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
   + Joint loss training with a regularizer is not supported.
   + Dumping Adam optimizer states to AWS S3 is not supported.
+  + Dumping to remote file systems when enabled MPI is not supported.
 
 ## What's New in Version 4.0