This document explains the details of each models, such as export configuration, and input/output argument for onnx model.
Export configuration
config name | type | default | detail |
---|---|---|---|
feats_dim | int | 80 | Dimension of the input feature |
model input
input name | detail | shape | dtype | dynamic dim |
---|---|---|---|---|
feats | Input feature of the speech. | (1, feats_length, feats_dim) |
float32 | 1 |
model output
output name | detail | shape | dtype | dynamic dim |
---|---|---|---|---|
encoder_out | Output feature of encoder. | (1, feats_length, encoder_feats_dim) |
float32 | 1 |
encoder_out_lens | Length of output feature. | (1,) |
int64 | - |
Xformer encoder supports the following models
models |
---|
ConformerEncoder |
TransformerEncoder |
Export configuration
config name | type | default | detail |
---|---|---|---|
feats_dim | int | 80 | Dimension of the input feature |
max_seq_len | int | 512 | Maximum sequence length. |
model input
input name | detail | shape | dtype | dynamic dim |
---|---|---|---|---|
feats | Input feature of the speech. | (1, feats_length, feats_dim) |
float32 | 1 |
model output
output name | detail | shape | dtype | dynamic dim |
---|---|---|---|---|
encoder_out | Output feature of encoder. | (1, feats_length, encoder_feats_dim) |
float32 | 1 |
encoder_out_lens | Length of output feature. | (1,) |
int64 | - |
ContextualBlockXformer supports the following models.
models |
---|
contextual_block_conformer |
contextual_block_transformer |
Export configuration
config name | type | default | detail |
---|---|---|---|
feats_dim | int | 80 | Dimension of the input feature |
model input
input name | detail | shape | dtype | dynamic dim | previous |
---|---|---|---|---|---|
xs_pad | Input feature of the speech. | * Check Note for detail | float32 | 1 | - |
mask | Mask for every encoders. | (1, 1, block_size + 2, block_size + 2) |
float32 | 2, 3 | - |
buffer_before_downsampling | Model cache. This will be concatenated before the subsampling. | (1, subsample * 2, feats_dim) |
float32 | 1 | ◯ |
buffer_after_downsampling | Model cache. This will be concatenated after the subsampling. | (1, overlap_size, embed_dim) |
float32 | 1 | ◯ |
prev_addin | Addin to append before computation of the encoders. | (1, 1, embed_dim) |
float32 | - | ◯ |
pos_enc_xs | Positional encoding for input feature. | (1, block_size, embed_dim) |
float32 | 1 | - |
pos_enc_addin | Positional encoding for input addin. | (1, 1, embed_dim) |
float32 | - | - |
past_encoder_ctx | Previous contexutal vector | (1, n_encoders, h_enc) |
float32 | - | ◯ |
is_first | Flag to check if the first iteration | (1,) |
int64 | - | - |
Note:
mask
should be created with the following process.
mask = np.zeros((1, 1, block_size+2, block_size+2), dtype=np.float32)
mask[..., 1:, :-1] = 1
-
Arguments with
previous == ◯
should be an output of the previous inference, and zeros for the first inference. -
overlap_size
is computed asblock_size - hop_size
. -
embed_dim
is output dimension of positional encoding. -
is_first
is 1 for the first iteration, and 0 for the second and later iterations -
The size of
xs_pad
should should be the following:- First iteration:
(1, (block_size + 2) * subsample, feats_dim)
- Second or later iteration:
(1, hop_size * subsample, feats_dim)
- First iteration:
model output
input name | detail | shape | dtype | dynamic dim | next |
---|---|---|---|---|---|
ys_pad | Output of the streaming encoder. | (1, hop_size, encoder_feats_dim) |
float32 | 1 | - |
next_buffer_before_downsampling | This output will be an input for the next inference as buffer_before_downsampling |
(1, subsample * 2, feats_dim) |
float32 | 1 | ◯ |
next_buffer_after_downsampling | This output will be an input for the next inference as buffer_after_downsampling |
(1, overlap_size, embed_dim) |
float32 | 1 | ◯ |
next_addin | This output will be an input for the next inference as prev_addin |
(1, 1, embed_dim) |
float32 | - | ◯ |
next_encoder_ctx | This output will be an input for the next inference as past_encoder_ctx |
(1, n_encoders, h_enc) |
float32 | - | ◯ |
Export configuration
There is no configuration available.
model input
input name | detail | shape | dtype | dynamic dim |
---|---|---|---|---|
vy | Input sequence | (1, seq_len) |
int64 | 1 |
z_prev_{i} | List of caches. The length equals to number of decoders. | List[(1, hidden_size) ] |
float32 | - |
c_prev_{i} | List of caches. The length equals to number of decoders. | List[(1, hidden_size) ] |
float32 | - |
a_prev_{i} | List of caches. The length equals to number of attentions. | *1 | float32 | *1 |
pceh_{i} | List of caches. pceh stands for pre_computed_enc_h |
*2 | float32 | 1 |
enc_h_{i} | List of caches. The length equals to number of attentions. | (1, feat_length, enc_size) |
float32 | 1 |
mask_{i} | List of mask. The length equals to number of attentions. *3 | List[(feat_length, feat_length) ] |
float32 | 0, 1 |
-
*1: The shape and dynamic axes of
a_prev_{i}
depends on the attention type.Attention type shape dynamic axes coverage
, orcoverage_location
(1, 1, feat_length)
2 others (1,feat_length)
1 -
*2: The shape pf
pceh_{i}
depends on the attention type.Attention type shape NoAtt (1, 1, 1)
others (1, feat_length, out_feature)
where
out_features
equals tomlp_enc.out_features
of each attention inatt_list
. -
*3: Each
mask
should be created as follows:from espnet_onnx.utils.function import make_pad_mask mask = make_pad_mask([feat_length]) * -float('inf')
model output
output name | detail | shape | dtype | dynamic dim |
---|---|---|---|---|
logp | Output feature of decoder. | (1, feats_length, decoder_feats_dim) |
float32 | 1 |
c_list_{i} | This argument should be an input of the next c_prev_{i} |
List[(1, hidden_size) ] |
float32 | - |
z_list_{i} | This argument should be an input of the next z_prev_{i} |
List[(1, hidden_size) ] |
float32 | - |
att_w *1 | This argument should be an input of the next a_prev_{i} |
*1 | float32 | - |
att_w_{i} | This argument should be an input of the next a_prev_{i} |
*1 | float32 | - |
- *1: When
num_enc == 1
, then output name isatt_w
, otherwiseatt_w_{i}
. The shape is as the same with model input.
Xformer decoder supports the following models
models |
---|
TransformerEncoder |
Export configuration
config name | type | default | detail |
---|---|---|---|
max_seq_len | int | 512 | Maximum sequence length. |
model input
input name | detail | shape | dtype | dynamic dim |
---|---|---|---|---|
tgt | Input token ids | (batch, maxlen_out) |
int64 | 0, 1 |
memory | encoded memory | (batch, maxlen_in, feat) |
float32 | 0, 1 |
cache | List of cached outputs. The length of list is the same as number of decoders. | List[(1, max_time_out-1, size) ] |
float32 | 0, 1 |
model output
output name | detail | shape | dtype | dynamic dim |
---|---|---|---|---|
y | Output feature of decoder. | (batch, feats_length, decoder_feats_dim) |
float32 | 0, 1 |
out_cache{i} | List of caches. The length of list is the same as number of decoders. This argument should be inputs for next cache |
List[(1, max_time_out-1, size) ] |
float32 | 0, 1 |
Export configuration
config name | type | default | detail |
---|---|---|---|
max_seq_len | int | 512 | Maximum sequence length. |
model input
input name | detail | shape | dtype | dynamic dim |
---|---|---|---|---|
labels | Label ID sequences | (batch, seq_len) |
int64 | 0, 1 |
h_cache | Cache for RNN | (dlayers, 1, dunits) |
float32 | 1 |
c_cache | Cache for RNN | (dlayers, 1, dunits) |
float32 | 1 |
model output
output name | detail | shape | dtype | dynamic dim |
---|---|---|---|---|
sequence | Output sequence of decoder. | (batch, seq_len, decoder_feats_dim) |
float32 | - |
out_h_cache | List of rnn cache. This argument should be inputs for next h_cache |
(dlayers, 1, dunits) |
float32 | 1 |
out_c_cache | List of rnn cache. This argument should be inputs for next c_cache |
(dlayers, 1, dunits) |
float32 | 1 |
Export configuration
There is no available configuration.
model input
input name | detail | shape | dtype | dynamic dim |
---|---|---|---|---|
x | Label ID sequences | (batch, seq_len) |
int64 | 0, 1 |
in_hidden1 | Cache for RNN | (dlayers, 1, dunits) |
float32 | 1 |
in_hidden2 | Cache for RNN. Not required if rnn_type is lstm. | (dlayers, 1, dunits) |
float32 | 1 |
model output
output name | detail | shape | dtype | dynamic dim |
---|---|---|---|---|
y | Output sequence of decoder. | (batch, seq_len, decoder_feats_dim) |
float32 | - |
out_hidden1 | List of rnn cache. This argument should be inputs for next in_hidden1 |
(dlayers, 1, dunits) |
float32 | 1 |
out_hidden2 | List of rnn cache. This argument should be inputs for next in_hidden2 |
(dlayers, 1, dunits) |
float32 | 1 |
Export configuration
config name | type | default | detail |
---|---|---|---|
max_seq_len | int | 512 | Maximum sequence length. |
model input
input name | detail | shape | dtype | dynamic dim |
---|---|---|---|---|
tgt | Label ID sequences | (batch, seq_len) |
int64 | 0, 1 |
cache_{i} | Cache for encoder. The length of list is same as the number of encoders | (batch, 1, enc_feats) |
float32 | 0, 1 |
model output
output name | detail | shape | dtype | dynamic dim |
---|---|---|---|---|
y | Output sequence of decoder. | (batch, seq_len, decoder_feats_dim) |
float32 | - |
cache_{i} | Cache for encoder. The length of list is same as the number of encoders | (batch, 1, enc_feats) |
float32 | 0, 1 |