Skip to content

Commit

Permalink
Merge branch 'fix-doc_link_to_sok-jershi' into 'main'
Browse files Browse the repository at this point in the history
Update docs

See merge request dl/hugectr/hugectr!1006
  • Loading branch information
EmmaQiaoCh committed Oct 24, 2022
2 parents f262e9d + bbf6d0f commit 11d1acc
Show file tree
Hide file tree
Showing 5 changed files with 14 additions and 1 deletion.
3 changes: 2 additions & 1 deletion docs/source/api/python_interface.md
Original file line number Diff line number Diff line change
Expand Up @@ -643,6 +643,7 @@ It trains the model for a fixed number of epochs (epoch mode) or iterations (non
* `snapshot`: Integer, the interval of iterations at which the snapshot model weights and optimizer states will be saved to files. This argument is invalid when embedding training cache is being used, which means no model parameters will be saved. The default value is 10000.

* `snapshot_prefix`: String, the prefix of the file names for the saved model weights and optimizer states. This argument is invalid when embedding training cache is being used, which means no model parameters will be saved. The default value is `''`. Remote file systems(HDFS and S3) are also supported. For example, for HDFS, the prefix can be `hdfs://localhost:9000/dir/to/model`. For S3, the prefix should be either virtual-hosted-style or path-style and contains the region information. For examples, take a look at the AWS official [documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-bucket-intro.html).
**Please note that dumping models to remote file system when enabled MPI is not supported yet.**

***

Expand Down Expand Up @@ -1090,7 +1091,7 @@ The stored sparse model can be used for both the later training and inference ca
Note that the key, slot id, and embedding vector are stored in the sparse model in the same sequence, so both the nth slot id in `slot_id` file and the nth embedding vector in the `emb_vector` file are mapped to the nth key in the `key` file.

**Arguments**
* `prefix`: String, the prefix of the saved files for model weights and optimizer states. There is NO default value and it should be specified by users. Remote file systems(HDFS and S3) are also supported. For example, for HDFS, the prefix can be `hdfs://localhost:9000/dir/to/model`. For S3, the prefix should be either virtual-hosted-style or path-style and contains the region information. For examples, take a look at the AWS official [documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-bucket-intro.html).
* `prefix`: String, the prefix of the saved files for model weights and optimizer states. There is NO default value and it should be specified by users. Remote file systems(HDFS and S3) are also supported. For example, for HDFS, the prefix can be `hdfs://localhost:9000/dir/to/model`. For S3, the prefix should be either virtual-hosted-style or path-style and contains the region information. For examples, take a look at the AWS official [documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-bucket-intro.html). **Please note that dumping models to remote file system when enabled MPI is not supported yet.**

* `iter`: Integer, the current number of iterations, which will be the suffix of the saved files for model weights and optimizer states. The default value is 0.

Expand Down
9 changes: 9 additions & 0 deletions docs/source/sparse_operation_kit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Sparse Operation Kit

[Sparse Operation Kit (SOK)](https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/sparse_operation_kit) is a Python package wrapped GPU accelerated operations dedicated for sparse training / inference cases. It is designed to be compatible with common deep learning (DL) frameworks like TensorFlow.
In sparse training / inference scenarios, for instance, CTR estimation, there are vast amounts of parameters which cannot fit into the memory of a single GPU. Many common DL frameworks only offer limited support for model parallelism (MP), because it can complicate using all available GPUs in a cluster to accelerate the whole training process.
SOK provides broad MP functionality to fully utilize all available GPUs, regardless of whether these GPUs are located in a single machine or multiple machines. Simultaneously, SOK takes advantage of existing data-parallel (DP) capabilities of DL frameworks to accelerate training while minimizing code changes. With SOK embedding layers, you can build a DNN model with mixed MP and DP. MP is used to shard large embedding parameter tables, such that they are distributed among the available GPUs to balance the workload, while DP is used for layers that only consume little GPU resources.

Please check this [SOK Documentation](https://nvidia-merlin.github.io/HugeCTR/sparse_operation_kit/master/index.html) for detail.

<img src="user_guide_src/workflow_of_embeddinglayer.png" width="1080px" align="center"/>
2 changes: 2 additions & 0 deletions docs/source/toc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ subtrees:
- file: hierarchical_parameter_server/notebooks/hps_tensorflow_triton_deployment_demo.ipynb
- file: hierarchical_parameter_server/api/index.rst
title: API Documentation
- file: sparse_operation_kit.md
title: Sparse Operation Kit
- file: performance.md
title: Performance
- file: notebooks/index.md
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions release_notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ By using the interface, the input DLPack capsule of embedding key can be a GPU t
Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
+ Joint loss training with a regularizer is not supported.
+ Dumping Adam optimizer states to AWS S3 is not supported.
+ Dumping to remote file systems when enabled MPI is not supported.

## What's New in Version 4.0

Expand Down

0 comments on commit 11d1acc

Please sign in to comment.