Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fukuda] 読み進めメモ #2

Open
keisukefukuda opened this issue Aug 3, 2019 · 11 comments
Open

[fukuda] 読み進めメモ #2

keisukefukuda opened this issue Aug 3, 2019 · 11 comments
Assignees

Comments

@keisukefukuda
Copy link

keisukefukuda commented Aug 3, 2019

読み進めメモ

半分時系列、半分項目別のまとめ

興味のある範囲

  • Allreduceアルゴリズム
  • Ringの作成と管理
  • Amazon EFA libfabricバインディング(外部リポジトリ)
  • Transport
  • Net
  • CUDA上の共有メモリで、 NCCL_ERROR_UNKNOWN で死ぬことがあるのでその調査

目次

参考

@keisukefukuda
Copy link
Author

keisukefukuda commented Aug 3, 2019

Transportインターフェース

関係するtypedef/struct/class

メモ

selectTransport の中で、2つの ncclPerrInfomyInfo, peerInfo)が接続可能かどうかをしらべて ncclTransport::canConnect 、接続可能だったら setup で接続する

Transportの種類は3種類

  • p2p = GPU間の直接通信
  • net = TCP ?
  • shm = SystemV shared memory

それぞれのTransportの関数(メンバ関数風)は、 例えば netCanConnect みたいな名前になっている

(疑問とか)

  • proxyってなんや?TransportとProxyの関係は? → 後で調べる。NetだけProxyがあるのはなんでだろう?→(仮説)Device⇔Hostのメモリ転送を担当する? P2PはGPUDirectだから、デバイスドライバとカーネルモジュールがD⇔Hコピーを担当してくれる、SHMはCUDAの機能だから(裏でNVLINK使ったりして)CUDA Runtineがやってくれる。Net(TCP?)だけはメモリ転送が必要?
  • GPUDirectの無いIBの場合は、Transportは netp2p のどちらになる?GPUDirectが無い場合ってD⇔H転送は誰がやるんだっけ?

NET transport

struct ncclTransport netTransport = {
  "NET",
  netCanConnect,
  netGetRings,
  { netSendSetup, netSendConnect, netSendFree, netSendProxy },
  { netRecvSetup, netRecvConnect, netRecvFree, netRecvProxy }
};

P2P transport

struct ncclTransport p2pTransport = {
  "P2P",
  p2pCanConnect,
  p2pGetRings,
  { p2pSendSetup, p2pSendConnect, p2pSendFree, NULL },
  { p2pRecvSetup, p2pRecvConnect, p2pRecvFree, NULL }
};

p2pCanConnect

https://sourcegraph.com/github.com/nccl-reader/nccl@7c72dee/-/blob/src/transport/p2p.cc#L58:1
p2p接続ができるかどうか判定する関数。ノード内GPU間通信(NVLINK)のキモっぽい感じ

ロジック:

  • オプションでP2Pが無効にされていたら無効(*ret = 0
  • ncclParamP2pLevel が指定されていたらそれを使う
  • ノードが違ったら無条件で *ret = 0 (それはそう) if (myInfo->hostHash != peerInfo->hostHash)
  • busIdToCudaDev() で、peerのCUDA device idを求める
  • myとpeerが同じデバイス上にあったら *ret = 1 + PATH_SYS。コメントにも書いてあるが、これは実際にはサポートしないという意味(PATH_SYS + == PATH_ARRAY_SIZEは番兵なので、サポートされないという意味) 同じデバイス上にいるんだから cudaMemcpy()` でいいんじゃね?という気もするが、サポートできない理由がなにかあるんだろう
  • cudaDeviceCanAccessPeer() でcudaDev間の接続を調べている
  • 接続できなければ if (p2p == 0)*ret = 0 のまま
  • getNvlinkGpu() で、NVLINKが使えるかどうか調べている?→それがOKならその値を返す
  • 最後に、PCI上での距離を数えている。 getCudaPath() を使って、Pathを取得している。そのPathを使って、pciDistance() で距離を数えている

p2pSendConnect

ncclConnect*ncclConnector* の2つを引数として取り、接続処理をする。基本的には、 cudaIpcOpenMemHandle で接続処理を行う

  • ncclConnectncclConnector は何でどう違う? → たぶん、 ncclConnect の方は、 transport方式ごとに固有のconnecxt info 的な意味なんだろうと思われる。それに対して、Connectorは、接続に使われるメモリポインタとかの細かい情報。ことなるTransportでも共通に使われる

ncclConnect()

struct ncclConnect {
  char data[CONNECT_SIZE];
};

😇


struct p2pConnectInfo {
 int direct;
 union {
   void* directPtr;
   cudaIpcMemHandle_t devIpc;
 };
};

SHM transport

struct ncclTransport shmTransport = {
  "SHM",
  shmCanConnect,
  shmGetRings,
  { shmSendSetup, shmSendConnect, shmSendFree, NULL },
  { shmRecvSetup, shmRecvConnect, shmRecvFree, NULL }
};

@keisukefukuda
Copy link
Author

keisukefukuda commented Aug 3, 2019

マクロとか定数とか

気になったマクロとかの説明

ncclPathDistance

src/include/topo.h

enum ncclPathDist {
  PATH_PIX  = 0,
  PATH_PXB  = 1,
  PATH_PHB  = 2,
  PATH_NODE = 3,
  PATH_SYS  = 4,
  PATH_ARRAY_SIZE = 5
};

PCI-ex上での距離を表す定数。これは nvidia-smi topo -m で取れる

$ nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	mlx4_0	mlx4_1	CPU Affinity
GPU0	 X 	PIX	PIX	PIX	PHB	PHB	PHB	PHB	PIX	PHB	0-7,16-23
GPU1	PIX	 X 	PIX	PIX	PHB	PHB	PHB	PHB	PIX	PHB	0-7,16-23
GPU2	PIX	PIX	 X 	PIX	PHB	PHB	PHB	PHB	PIX	PHB	0-7,16-23
GPU3	PIX	PIX	PIX	 X 	PHB	PHB	PHB	PHB	PIX	PHB	0-7,16-23
GPU4	PHB	PHB	PHB	PHB	 X 	PIX	PIX	PIX	PHB	PIX	0-7,16-23
GPU5	PHB	PHB	PHB	PHB	PIX	 X 	PIX	PIX	PHB	PIX	0-7,16-23
GPU6	PHB	PHB	PHB	PHB	PIX	PIX	 X 	PIX	PHB	PIX	0-7,16-23
GPU7	PHB	PHB	PHB	PHB	PIX	PIX	PIX	 X 	PHB	PIX	0-7,16-23
mlx4_0	PIX	PIX	PIX	PIX	PHB	PHB	PHB	PHB	 X 	PHB
mlx4_1	PHB	PHB	PHB	PHB	PIX	PIX	PIX	PIX	PHB	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

@keisukefukuda
Copy link
Author

keisukefukuda commented Aug 3, 2019

getCudaPath()pciDistance()

cudaDeviceGetPCIBusId PCI bus id というものが取れるらしく、これはMax13文字の文字列

Returned identifier string for the device in the following format
[domain]:[bus]:[device].[function] where domain, bus, device,
and function are all hexadecimal values.
pciBusId should be large enough to store 13 characters including the NULL-terminator.

0000:00:00.0 こんな感じの書式

getCudaPath() は、cudaDeviceGetPCIBusId() が返す文字列を、LinuxのPCIバスの表記に変更する

/sys/class については、ここらへんに説明があった

@keisukefukuda
Copy link
Author

-✄-✄-✄-✄-✄-✄-✄--キリトリ線--✄-✄-✄-✄-✄-✄-✄--

ここから 「第2回NCCLコード読み回」 2019/09/01

@keisukefukuda
Copy link
Author

keisukefukuda commented Sep 1, 2019

今日は、実際のコードにログを仕込んで、イベントハンドラの流れを追ってみる。

まずは persistentThread 関数を解剖

次に、 proxyOp の実体たる netSendProxy の中身を見てみる

@keisukefukuda
Copy link
Author

keisukefukuda commented Sep 1, 2019

netSendProxy の中身を見る

struct ncclProxyArgs {
  proxyProgressFunc_t progress;
  struct ncclChannel* channel;
  struct ncclConnector* connector;
  int sliceSteps;
  int chunkSteps;
  int nsteps;
  uint64_t opCount;
  int llMode;
  int state;   // add component before this line -- it is left out during initialization
   
  // Internal state
  uint64_t head;
  uint64_t tail;
  uint64_t end;
  void* requests[NCCL_STEPS];
  int idle;
   
  // Element linking
  pthread_mutex_t mutex;
  struct ncclProxyArgs* next;
  struct ncclProxyArgs* nextPeer;
};
struct netSendResources {
  void* netSendComm;
  struct ncclSendMem* hostSendMem;
  struct ncclRecvMem* hostRecvMem;
  struct ncclSendMem* devHostSendMem;
  struct ncclRecvMem* devHostRecvMem;
  int netDev;
  int useGdr;
  int buffSize;
  void* mhandle;
  void* llMhandle;
  struct ncclRecvMem* devRecvMem;
  uint64_t step;
  uint64_t llLastCleaning;
};

args->state == ncclProxyOpProgress の処理

llMode = 0 として調査してみる

keisukefukuda@c18a429

↑こういうデバッグログを埋め込んでみて、以下のコマンドラインで動かした結果

[1,15]<stdout>:KF: transport/net.cc:455 netSendProxy [Rank 15] ncclProxyOnProgress: -------------------
[1,15]<stdout>:KF: transport/net.cc:456 netSendProxy [Rank 15] ncclProxyOnProgress: args->head = 8
[1,15]<stdout>:KF: transport/net.cc:457 netSendProxy [Rank 15] ncclProxyOnProgress: args->tail = 10
[1,15]<stdout>:KF: transport/net.cc:458 netSendProxy [Rank 15] ncclProxyOnProgress: args->end = 120
[1,15]<stdout>:KF: transport/net.cc:459 netSendProxy [Rank 15] ncclProxyOnProgress: args->nsteps = 120
[1,15]<stdout>:KF: transport/net.cc:460 netSendProxy [Rank 15] ncclProxyOnProgress: args->llMode = 0
[1,15]<stdout>:KF: transport/net.cc:461 netSendProxy [Rank 15] ncclProxyOnProgress: args->sliceSteps = 2
[1,15]<stdout>:KF: transport/net.cc:462 netSendProxy [Rank 15] ncclProxyOnProgress: args->chunkSteps = 4
[1,15]<stdout>:KF: transport/net.cc:467 netSendProxy [Rank 15] ncclProxyOnProgress: *recvTail = 16
[1,15]<stdout>:KF: transport/net.cc:498 netSendProxy [Rank 15] ncclProxyOnProgress: buffSlot = 2
[1,15]<stdout>:KF: transport/net.cc:455 netSendProxy [Rank 15] ncclProxyOnProgress: -------------------
[1,15]<stdout>:KF: transport/net.cc:456 netSendProxy [Rank 15] ncclProxyOnProgress: args->head = 28
[1,15]<stdout>:KF: transport/net.cc:457 netSendProxy [Rank 15] ncclProxyOnProgress: args->tail = 30
[1,15]<stdout>:KF: transport/net.cc:458 netSendProxy [Rank 15] ncclProxyOnProgress: args->end = 120
[1,15]<stdout>:KF: transport/net.cc:459 netSendProxy [Rank 15] ncclProxyOnProgress: args->nsteps = 120
[1,15]<stdout>:KF: transport/net.cc:460 netSendProxy [Rank 15] ncclProxyOnProgress: args->llMode = 0
[1,15]<stdout>:KF: transport/net.cc:461 netSendProxy [Rank 15] ncclProxyOnProgress: args->sliceSteps = 2
[1,15]<stdout>:KF: transport/net.cc:462 netSendProxy [Rank 15] ncclProxyOnProgress: args->chunkSteps = 4
[1,15]<stdout>:KF: transport/net.cc:467 netSendProxy [Rank 15] ncclProxyOnProgress: *recvTail = 36
[1,15]<stdout>:KF: transport/net.cc:498 netSendProxy [Rank 15] ncclProxyOnProgress: buffSlot = 6
[1,15]<stdout>:KF: transport/net.cc:455 netSendProxy [Rank 15] ncclProxyOnProgress: -------------------
[1,15]<stdout>:KF: transport/net.cc:456 netSendProxy [Rank 15] ncclProxyOnProgress: args->head = 10
[1,15]<stdout>:KF: transport/net.cc:457 netSendProxy [Rank 15] ncclProxyOnProgress: args->tail = 12
[1,15]<stdout>:KF: transport/net.cc:458 netSendProxy [Rank 15] ncclProxyOnProgress: args->end = 120
[1,15]<stdout>:KF: transport/net.cc:459 netSendProxy [Rank 15] ncclProxyOnProgress: args->nsteps = 120
[1,15]<stdout>:KF: transport/net.cc:460 netSendProxy [Rank 15] ncclProxyOnProgress: args->llMode = 0
[1,15]<stdout>:KF: transport/net.cc:461 netSendProxy [Rank 15] ncclProxyOnProgress: args->sliceSteps = 2
[1,15]<stdout>:KF: transport/net.cc:462 netSendProxy [Rank 15] ncclProxyOnProgress: args->chunkSteps = 4
[1,15]<stdout>:KF: transport/net.cc:467 netSendProxy [Rank 15] ncclProxyOnProgress: *recvTail = 18
[1,15]<stdout>:KF: transport/net.cc:498 netSendProxy [Rank 15] ncclProxyOnProgress: buffSlot = 4
[1,15]<stdout>:KF: transport/net.cc:455 netSendProxy [Rank 15] ncclProxyOnProgress: -------------------
[1,15]<stdout>:KF: transport/net.cc:456 netSendProxy [Rank 15] ncclProxyOnProgress: args->head = 30
[1,15]<stdout>:KF: transport/net.cc:457 netSendProxy [Rank 15] ncclProxyOnProgress: args->tail = 32
[1,15]<stdout>:KF: transport/net.cc:458 netSendProxy [Rank 15] ncclProxyOnProgress: args->end = 120
[1,15]<stdout>:KF: transport/net.cc:459 netSendProxy [Rank 15] ncclProxyOnProgress: args->nsteps = 120
[1,15]<stdout>:KF: transport/net.cc:460 netSendProxy [Rank 15] ncclProxyOnProgress: args->llMode = 0
[1,15]<stdout>:KF: transport/net.cc:461 netSendProxy [Rank 15] ncclProxyOnProgress: args->sliceSteps = 2
[1,15]<stdout>:KF: transport/net.cc:462 netSendProxy [Rank 15] ncclProxyOnProgress: args->chunkSteps = 4
[1,15]<stdout>:KF: transport/net.cc:467 netSendProxy [Rank 15] ncclProxyOnProgress: *recvTail = 38
[1,15]<stdout>:KF: transport/net.cc:498 netSendProxy [Rank 15] ncclProxyOnProgress: buffSlot = 0

@keisukefukuda
Copy link
Author

-✄-✄-✄-✄-✄-✄-✄-- キリトリ線 --✄-✄-✄-✄-✄-✄-✄--

ここから 「第 3 回NCCLコード読み回」 2019/10/06

@keisukefukuda
Copy link
Author

# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 0 iters: 1 validation: 1
#
# Using devices
#   Rank  0 Pid  32315 on GPCL-GPU131 device  0 [0x04] Tesla P100-PCIE-16GB
#   Rank  1 Pid  32317 on GPCL-GPU131 device  1 [0x05] Tesla P100-PCIE-16GB
#   Rank  2 Pid  32319 on GPCL-GPU131 device  2 [0x06] Tesla P100-PCIE-16GB
#   Rank  3 Pid  32321 on GPCL-GPU131 device  3 [0x07] Tesla P100-PCIE-16GB
#   Rank  4 Pid  32323 on GPCL-GPU131 device  4 [0x0b] Tesla P100-PCIE-16GB
#   Rank  5 Pid  32325 on GPCL-GPU131 device  5 [0x0c] Tesla P100-PCIE-16GB
#   Rank  6 Pid  32328 on GPCL-GPU131 device  6 [0x0d] Tesla P100-PCIE-16GB
#   Rank  7 Pid  32330 on GPCL-GPU131 device  7 [0x0e] Tesla P100-PCIE-16GB
#   Rank  8 Pid  16909 on GPCL-GPU132 device  0 [0x04] Tesla P100-PCIE-16GB
#   Rank  9 Pid  16911 on GPCL-GPU132 device  1 [0x05] Tesla P100-PCIE-16GB
#   Rank 10 Pid  16914 on GPCL-GPU132 device  2 [0x06] Tesla P100-PCIE-16GB
#   Rank 11 Pid  16916 on GPCL-GPU132 device  3 [0x07] Tesla P100-PCIE-16GB
#   Rank 12 Pid  16918 on GPCL-GPU132 device  4 [0x0b] Tesla P100-PCIE-16GB
#   Rank 13 Pid  16920 on GPCL-GPU132 device  5 [0x0c] Tesla P100-PCIE-16GB
#   Rank 14 Pid  16922 on GPCL-GPU132 device  6 [0x0d] Tesla P100-PCIE-16GB
#   Rank 15 Pid  16926 on GPCL-GPU132 device  7 [0x0e] Tesla P100-PCIE-16GB
[GPCL-GPU131:31907] 15 more processes have sent help message help-mpi-btl-openib.txt / default subnet prefix
[GPCL-GPU131:31907] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
KF: transport/net.cc:428 netSendProxy [Rank 15] ncclProxyOnReady: -------------------
KF: transport/net.cc:429 netSendProxy [Rank 15] ncclProxyOnReady: opCount = 0
KF: transport/net.cc:430 netSendProxy [Rank 15] ncclProxyOnReady: resources->netDev = 0
KF: transport/net.cc:431 netSendProxy [Rank 15] ncclProxyOnReady: resources->useGdr = 0
KF: transport/net.cc:432 netSendProxy [Rank 15] ncclProxyOnReady: resources->buffSize = 4194304
KF: transport/net.cc:433 netSendProxy [Rank 15] ncclProxyOnReady: resources->step = 0
KF: transport/net.cc:441 netSendProxy [Rank 15] ncclProxyOnReady: args->head = 0
KF: transport/net.cc:442 netSendProxy [Rank 15] ncclProxyOnReady: args->tail = 0
KF: transport/net.cc:443 netSendProxy [Rank 15] ncclProxyOnReady: args->end = 120
KF: transport/net.cc:444 netSendProxy [Rank 15] ncclProxyOnReady: args->nsteps = 120
KF: transport/net.cc:446 netSendProxy [Rank 15] ncclProxyOnReady: sizesFifo = -1 -1 -1 -1 -1 -1 -1 -1
KF: transport/net.cc:455 netSendProxy [Rank 15] ncclProxyOnProgress: -------------------
KF: transport/net.cc:456 netSendProxy [Rank 15] ncclProxyOnProgress: args->head = 0
KF: transport/net.cc:457 netSendProxy [Rank 15] ncclProxyOnProgress: args->tail = 0
KF: transport/net.cc:458 netSendProxy [Rank 15] ncclProxyOnProgress: args->end = 120
KF: transport/net.cc:459 netSendProxy [Rank 15] ncclProxyOnProgress: args->nsteps = 120
KF: transport/net.cc:460 netSendProxy [Rank 15] ncclProxyOnProgress: args->llMode = 0
KF: transport/net.cc:461 netSendProxy [Rank 15] ncclProxyOnProgress: args->sliceSteps = 2
KF: transport/net.cc:462 netSendProxy [Rank 15] ncclProxyOnProgress: args->chunkSteps = 4
KF: transport/net.cc:467 netSendProxy [Rank 15] ncclProxyOnProgress: *recvTail = 4
KF: transport/net.cc:498 netSendProxy [Rank 15] ncclProxyOnProgress: buffSlot = 0

@keisukefukuda
Copy link
Author

netSendResources.netDev の意味を調べる

https://sourcegraph.com/github.com/keisukefukuda/nccl@code-reading/-/blob/src/transport/net.cc#L295

resources->netDev = getDev(cudaDev, channelId);

getScore() で得られるScoreの高いものを選んでいる(詳細は理解していない)

getDev 関数でデバイスを取得。 cudaGetDevice() で普通にDevice IDを取得してから、さらに getDev() を経由している。なぜ?

getDev()の中身

https://sourcegraph.com/github.com/keisukefukuda/nccl@code-reading/-/blob/src/transport/net.cc#L229:1

ncclNetTValues を使って計算

ncclTValues[]

https://sourcegraph.com/github.com/keisukefukuda/nccl@code-reading/-/blob/src/transport/net.cc#L23:21

これは距離に基づくスコア

ncclTvalue_t score = 1 + PATH_SYS - distances[d];
tvalue \|= ((score & NET_BITS_PER_IF_MASK)<<(NET_BITS_PER_IF*d));

@keisukefukuda
Copy link
Author

keisukefukuda commented Oct 6, 2019

resources->useGdr を調べる

netGetGdrSupport で設定
https://sourcegraph.com/github.com/keisukefukuda/nccl@code-reading/-/blob/src/transport/net.cc#L251:7

実験した感じ、 nvmlDevcudaDev は同じ値っぽい。

デフォルトだと、gdrReadParam = -2 っぽい

NCCL_PARAM(NetGdrRead, "NET_GDR_READ", -2);

// For reads (sends) only enable under certain conditions の意味: NCCL_NET_GDR_READ < 0 のとき(デフォルトではそう)は、 nvlink が有効でないとGDRは使えない。

KF: transport/net.cc:260 netGetGdrSupport [Rank 7] netGetGdrSupport: nvmlDev = 7
KF: transport/net.cc:261 netGetGdrSupport [Rank 7] netGetGdrSupport: read = 1
KF: transport/net.cc:262 netGetGdrSupport [Rank 7] netGetGdrSupport: HCA = 0
KF: transport/net.cc:266 netGetGdrSupport [Rank 7] netGetGdrSupport: gdrReadParam = -2
KF: transport/net.cc:271 netGetGdrSupport [Rank 7] netGetGdrSupport: netGdrLevel = 2
KF: transport/net.cc:304 netGetGdrSupport [Rank 8] netGetGdrSupport: flags = 3
KF: transport/net.cc:305 netGetGdrSupport [Rank 8] netGetGdrSupport: flags & NCCL_PTR_CUDA = 2
KF: transport/net.cc:304 netGetGdrSupport [Rank 4] netGetGdrSupport: flags = 3
KF: transport/net.cc:305 netGetGdrSupport [Rank 4] netGetGdrSupport: flags & NCCL_PTR_CUDA = 2
KF: transport/net.cc:280 netGetGdrSupport [Rank 7] netGetGdrSupport: nvlink = 0
KF: transport/net.cc:308 netGetGdrSupport [Rank 12] NET/IB : GPU Direct RDMA Enabled for GPU 4[4] / HCA 0 (distance 0 < 2), read 0

netDistance()

実体は --> pciDistance()

ncclNetPtrSupport()

https://sourcegraph.com/github.com/keisukefukuda/nccl@code-reading/-/blob/src/include/net.h#L20:21

  • 引数の int dev はHCAのこと
  • 実体は ncclNet->ptrSupport
  • ncclNet の実体は ncclNetIb ncclIbPtrSupport
  • --> ncclIbPtrSupport
  • さらにその実体はほとんど ncclIbGdrSupport
    • "/sys/kernel/mm/memory_peers/nv_mem/version" が読めるかどうかチェック
    • wrap_direct_ibv_reg_mr を呼んでいる --> TODO

getNvmlDevice()

@keisukefukuda
Copy link
Author

keisukefukuda commented Oct 6, 2019

thanks to 上野-san

(用語の整理)

  • Buffer = chunk * 2 (default 4MiB)
  • Chunk = slice * 2 (RingAllreduceのときは 2)
  • Slice = step * 2 (allreduce_slice_steps = 2) (primitievesの中で、Allreduceの場合、送信・受信の単位)
  • Step = 最小サイズ(結果的にbuffer sizeの 1/8)(Allreduceの場合、このサイズで具体的に何かが行われることはない)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant