Modeling Recurrence for Transformer #4

YeonwooSung · 2020-08-25T14:32:06Z

propose additional "Attentive Recurrent Network(ARN)" to Transformer encoder to leverage the strengths of both attention and recurrent networks
WMT14 EnDe and WMT17 ZhEn demonstrates the effectiveness
study reveals that a short-cut bridge of shallow ARN outperforms deep counterpart

recurrent model can be simple (a) RNN, GRU, LSTM or (b) Attentive Recurrent Network where context representation is generated via attention with previous hidden state

ablation study on size of addition recurrent encoder
- smaller BiARN encoder attached directly to top of decoder outperforms all others

ablation study on how to integrate representation in the decoder side
- stack on top outperformed all others

what linguistic characteristics are models learning?
- 1-Layer BiARN performs better on all syntactic and some semantic tasks
List of Linguistic Characteristics
- SeLen : sentence length
- WC : recover original words given its source embedding
- TrDep : check whether encoder infers the hierarchical structure of sentences
- ToCo : classify in terms of the sequence of top constituents
- BShif : tests whether two consecutive tokens are inverted
- Tense : predict tense of the main-clause verb
- SubN : number of main-clause subjects
- ObjN : number of the direct object of the main clause
- SoMo : check whether some sentences are modified by replacing a random noun or verb
- CoIn : two coordinate clauses with half the sentence inverted

Translation requires a complicated encoding function in the source side. Pros of attention, rnn, cnn can be complemented to produce a richer representation
This paper showed that there is a small room of improvement for rnn encoder to play part in Transformer encoder with short-cut trick

YeonwooSung · 2020-08-26T13:16:41Z

Link: https://arxiv.org/abs/2002.00937
Authors: Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou

YeonwooSung added Attention mechanism NMT RNN labels Aug 25, 2020

Provide feedback