The following two tables is a comparison of performance between LightSeq and Faster Transformer, Which is tested on Tesla T4 with a model of Transformer-base. We also provide a TF baseline which's code is from Faster Transformer.
Beam search
batch_size | beam_size | seq_len | TF(ms) | FT(ms) | lightseq(ms) | PyTorch(ms) | FT speedup | lightseq speedup | PyTorch speedup |
---|---|---|---|---|---|---|---|---|---|
1 | 4 | 32 | 419.53 | 26.25 | 29.66 | 385.23 | 15.98 | 14.14 | 1.09 |
1 | 4 | 64 | 806.38 | 54.02 | 63.04 | 760.77 | 14.93 | 12.79 | 1.06 |
8 | 4 | 32 | 439.64 | 35.99 | 34.77 | 416.06 | 12.22 | 12.64 | 1.06 |
8 | 4 | 64 | 891.54 | 79.82 | 79.43 | 835.79 | 11.17 | 11.22 | 1.07 |
32 | 4 | 32 | 536 | 82.82 | 59.49 | 429.78 | 6.47 | 9.01 | 1.25 |
32 | 4 | 64 | 1116.74 | 198.95 | 155.08 | 929.97 | 5.61 | 7.20 | 1.20 |
64 | 4 | 32 | 668.45 | 144.53 | 101.54 | 520.66 | 4.62 | 6.58 | 1.28 |
64 | 4 | 64 | 1476.17 | 351.14 | 277.4 | 1237.79 | 4.20 | 5.32 | 1.19 |
128 | 4 | 32 | 996.88 | 271.8 | 200.49 | 721.66 | 3.67 | 4.97 | 1.38 |
128 | 4 | 64 | 2157.85 | 671.76 | 502.91 | 2158.81 | 3.21 | 4.29 | 1.00 |
Sampling
batch_size | topk/topp | seq_len | FT(ms) | lightseq(ms) | lightseq speedup |
---|---|---|---|---|---|
1 | 0.75 | 32 | 34.4 | 29.66 | 1.16 |
1 | 0.75 | 64 | 71.45 | 59.72 | 1.20 |
32 | 0.75 | 32 | 56.61 | 40.40 | 1.40 |
32 | 0.75 | 64 | 120.39 | 100.36 | 1.20 |
128 | 0.75 | 32 | 111.4 | 94.68 | 1.18 |
128 | 0.75 | 64 | 246.97 | 270.55 | 0.91 |
1 | 32 | 32 | 34.35 | 28.06 | 1.22 |
1 | 32 | 64 | 72.48 | 56.4 | 1.29 |
32 | 32 | 32 | 40.15 | 39.23 | 1.02 |
32 | 32 | 64 | 87.46 | 98.62 | 0.89 |
128 | 32 | 32 | 99 | 90.83 | 1.09 |
128 | 32 | 64 | 222.62 | 262 | 0.85 |
The following table is a comparison on a fr2en translation model which is a Transformer-big with a beam size of 4 and a target vocabulary size of approximately 30k. FP32 models are tested on Tesla P4, and FP16 models are tested on Tesla T4.
batch_size | seq_len | tf-fp32, ms | lightseq-fp32, ms | lightseq-fp16, ms | lightseq-fp32/tf-fp32, speedup | lightseq-fp16/lightseq-fp32, speedup | lightseq-fp16/tf-fp32, speedup |
---|---|---|---|---|---|---|---|
1 | 6 | 303 | 47 | 27 | 6.44 | 1.74 | 11.22 |
1 | 12 | 399 | 63 | 38 | 6.33 | 1.66 | 10.5 |
1 | 18 | 702 | 108 | 59 | 6.5 | 1.83 | 11.9 |
1 | 24 | 1071 | 167 | 82 | 6.41 | 2.04 | 13.06 |
1 | 36 | 1234 | 192 | 105 | 6.42 | 1.83 | 11.75 |
1 | 46 | 1445 | 227 | 110 | 6.36 | 2.06 | 13.14 |
1 | 58 | 1887 | 303 | 142 | 6.22 | 2.13 | 13.29 |
1 | 70 | 2771 | 428 | 197 | 6.47 | 2.17 | 14.07 |
2 | 6 | 317 | 57 | 32 | 5.56 | 1.78 | 9.91 |
2 | 12 | 418 | 73 | 39 | 5.72 | 1.87 | 10.72 |
2 | 18 | 723 | 131 | 66 | 5.51 | 1.98 | 10.95 |
2 | 24 | 1113 | 201 | 91 | 5.53 | 2.21 | 12.23 |
2 | 36 | 1276 | 234 | 104 | 5.45 | 2.25 | 12.27 |
2 | 46 | 1521 | 282 | 121 | 5.39 | 2.33 | 12.57 |
2 | 58 | 2004 | 371 | 159 | 5.4 | 2.33 | 12.6 |
2 | 70 | 2965 | 542 | 221 | 5.47 | 2.45 | 13.42 |
4 | 6 | 326 | 61 | 39 | 5.34 | 1.56 | 8.36 |
4 | 12 | 433 | 85 | 47 | 5.09 | 1.81 | 9.21 |
4 | 18 | 761 | 154 | 77 | 4.94 | 2 | 9.88 |
4 | 24 | 1195 | 245 | 113 | 4.87 | 2.17 | 10.58 |
4 | 36 | 1391 | 282 | 128 | 4.93 | 2.2 | 10.87 |
4 | 46 | 1679 | 339 | 153 | 4.95 | 2.22 | 10.97 |
4 | 58 | 2232 | 455 | 199 | 4.9 | 2.29 | 11.22 |
4 | 70 | 3406 | 673 | 285 | 5.06 | 2.36 | 11.95 |
8 | 6 | 364 | 76 | 43 | 4.78 | 1.77 | 8.47 |
8 | 12 | 470 | 110 | 56 | 4.27 | 1.96 | 8.39 |
8 | 18 | 854 | 205 | 91 | 4.16 | 2.25 | 9.38 |
8 | 24 | 1381 | 318 | 139 | 4.34 | 2.29 | 9.94 |
8 | 36 | 1628 | 378 | 156 | 4.3 | 2.42 | 10.44 |
8 | 46 | 1989 | 459 | 193 | 4.33 | 2.38 | 10.31 |
8 | 58 | 2683 | 617 | 254 | 4.34 | 2.43 | 10.56 |
8 | 70 | 4251 | 949 | 382 | 4.47 | 2.48 | 11.13 |
The following table is a comparison on a en2zh translation model which is a Transformer-deep(Compared with Transformer-big, it has 16 layers of encoder and other configurations remain the same) with a beam size of 4 and a target vocabulary size of approximately 30k. FP32 models are tested on Tesla P4, and FP16 models are tested on Tesla T4.
batch_size | seq_len | tf-fp32, ms | lightseq-fp32, ms | lightseq-fp16, ms | lightseq-fp32/tf-fp32, speedup | lightseq-fp16/lightseq-fp32, speedup | lightseq-fp16/tf-fp32, speedup |
---|---|---|---|---|---|---|---|
1 | 12 | 544 | 86 | 43 | 6.32 | 2 | 12.65 |
1 | 24 | 914 | 131 | 66 | 6.97 | 1.98 | 13.85 |
1 | 36 | 1290 | 200 | 93 | 6.45 | 2.15 | 13.87 |
1 | 48 | 1836 | 233 | 106 | 7.89 | 2.2 | 17.32 |
1 | 72 | 3456 | 482 | 212 | 7.17 | 2.27 | 16.3 |
1 | 84 | 2626 | 431 | 193 | 6.09 | 2.23 | 13.61 |
2 | 12 | 566 | 100 | 50 | 5.66 | 2 | 11.32 |
2 | 24 | 842 | 158 | 70 | 5.32 | 2.26 | 12.03 |
2 | 36 | 1287 | 247 | 103 | 5.21 | 2.4 | 12.5 |
2 | 48 | 1504 | 288 | 118 | 5.22 | 2.44 | 12.75 |
2 | 72 | 3131 | 611 | 240 | 5.12 | 2.55 | 13.05 |
2 | 84 | 2789 | 546 | 217 | 5.1 | 2.52 | 12.85 |
4 | 12 | 590 | 118 | 58 | 5 | 2.03 | 10.17 |
4 | 24 | 885 | 187 | 89 | 4.73 | 2.1 | 9.94 |
4 | 36 | 1380 | 301 | 127 | 4.58 | 2.37 | 10.87 |
4 | 48 | 1622 | 352 | 149 | 4.6 | 2.36 | 10.89 |
4 | 72 | 3492 | 763 | 311 | 4.57 | 2.45 | 11.23 |
4 | 84 | 3145 | 687 | 282 | 4.57 | 2.44 | 11.15 |
8 | 12 | 631 | 150 | 66 | 4.2 | 2.27 | 9.56 |
8 | 24 | 979 | 248 | 103 | 3.94 | 2.41 | 9.5 |
8 | 36 | 1584 | 412 | 156 | 3.84 | 2.64 | 10.15 |
8 | 48 | 1880 | 477 | 186 | 3.94 | 2.56 | 10.11 |
8 | 72 | 4218 | 1069 | 404 | 3.94 | 2.65 | 10.44 |
8 | 84 | 3831 | 976 | 373 | 3.92 | 2.62 | 10.27 |