Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modeling Recurrence for Transformer #4

Open
YeonwooSung opened this issue Aug 25, 2020 · 1 comment
Open

Modeling Recurrence for Transformer #4

YeonwooSung opened this issue Aug 25, 2020 · 1 comment

Comments

@YeonwooSung
Copy link
Owner

Abstract

  • propose additional "Attentive Recurrent Network(ARN)" to Transformer encoder to leverage the strengths of both attention and recurrent networks
  • WMT14 EnDe and WMT17 ZhEn demonstrates the effectiveness
  • study reveals that a short-cut bridge of shallow ARN outperforms deep counterpart

Details

Main Approach

  • use an additional recurrent encoder to the source side

fig2

  • recurrent model can be simple (a) RNN, GRU, LSTM or (b) Attentive Recurrent Network where context representation is generated via attention with previous hidden state

Impact of Components

  • ablation study on size of addition recurrent encoder
    • smaller BiARN encoder attached directly to top of decoder outperforms all others

table1

  • ablation study on number of recurrent steps in ARN
    • ~8 seems optimal

fig5

  • ablation study on how to integrate representation in the decoder side
    • stack on top outperformed all others

table2

fig4

Overall Result

  • with additional ARN encoder, BLEU scores improve with statistical significance

table3

Linguistic Analysis

  • what linguistic characteristics are models learning?
    • 1-Layer BiARN performs better on all syntactic and some semantic tasks
  • List of Linguistic Characteristics
    • SeLen : sentence length
    • WC : recover original words given its source embedding
    • TrDep : check whether encoder infers the hierarchical structure of sentences
    • ToCo : classify in terms of the sequence of top constituents
    • BShif : tests whether two consecutive tokens are inverted
    • Tense : predict tense of the main-clause verb
    • SubN : number of main-clause subjects
    • ObjN : number of the direct object of the main clause
    • SoMo : check whether some sentences are modified by replacing a random noun or verb
    • CoIn : two coordinate clauses with half the sentence inverted

table5

Personal Thoughts

  • Translation requires a complicated encoding function in the source side. Pros of attention, rnn, cnn can be complemented to produce a richer representation
  • This paper showed that there is a small room of improvement for rnn encoder to play part in Transformer encoder with short-cut trick

Link: https://arxiv.org/pdf/1904.03092v1.pdf
Authors: Hao et al. 2019

@YeonwooSung
Copy link
Owner Author

Link: https://arxiv.org/abs/2002.00937
Authors: Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant