Straggler-Aware In-Network Aggregation (SAINA) mitigates the straggler problem in distributed deep learning while preventing accuracy degradation. It aggregates local gradients of the fastest
We proposed Switch-Friendly Convergence Detection (SFCD) algorithm which detects a convergence point and increases
The source code of SAINA was implemented on top of SwitchML.
Batch Size = 16 | Batch Size = 32 | Batch Size = 64 |
---|---|---|
Batch Size = 16 | Batch Size = 32 | Batch Size = 64 |
---|---|---|
Batch Size = 16 | Batch Size = 32 | Batch Size = 64 |
---|---|---|
For ResNet50 model, when the batch size is set to 64, all schemes could not reach the target accuracy. Although we need to investigate better hyperparameters for improving the performance of all schemes, SAINA can reach the target accuracy faster than other schemes for other batch sizes.
VGG-16 | SqueezeNet | |
---|---|---|
30% | 472.4s | 771.2s |
40% | 448.8s | 543.1s |
50% | 426.0s | 401.0s |
60% | 555.6s | 508.4s |
70% | 564.0s | 530.3s |
VGG-16 | SqueezeNet | |
---|---|---|
1 | 451.0s | 475.5s |
3 | 426.3s | 464.1s |
5 | 426.0s | 401.0s |
7 | 451.1s | 417.3s |
9 | 463.0s | 434.1s |
|
423.8s | 396.5s |
Scheme | VGG-16 Final Accuracy | VGG-16 TTA Reduction |
---|---|---|
SAINA ( |
73.9% | - |
SAINA ( |
76.3% | 1.51x |
SAINA ( |
78.0% | 1.70x |
Scheme | VGG-16 Final Accuracy | SqueezeNet TTA Reduction |
---|---|---|
SAINA ( |
70.1% | 2.28x |
SAINA ( |
72.2% | 2.15x |
SAINA ( |
72.5% | 2.84x |
The above tables are results when the batch size is set to 32. In these tables, Final Accuracy means the accuracy when all training is completed, and TTA Reduction means how much TTA is reduced compared to when