Alleviating the Inequality of Attention Heads for Neural Machine Translation
Alleviating the Inequality of Attention Heads for Neural Machine Translation
Zewei Sun Shujian Huang Xin-Yu Dai Jiajun Chen

Abstract
Recent studies show that the attention heads in Transformer are not equal. We relate this phenomenon to the imbalance training of multi-head attention and the model dependence on specific heads. To tackle this problem, we propose a simple masking method: HeadMask, in two specific ways. Experiments show that translation improvements are achieved on multiple language pairs. Subsequent empirical analyses also support our assumption and confirm the effectiveness of the method.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| machine-translation-on-iwslt2015-vietnamese | HeadMask (Random-18) | BLEU: 26.85 |
| machine-translation-on-iwslt2015-vietnamese | HeadMask (Impt-18) | BLEU: 26.36 |
| machine-translation-on-wmt2016-romanian | HeadMask (Random-18) | BLEU score: 32.85 |
| machine-translation-on-wmt2016-romanian | HeadMask (Impt-18) | BLEU score: 32.95 |
| machine-translation-on-wmt2017-turkish | HeadMask (Impt-18) | BLEU score: 17.48 |
| machine-translation-on-wmt2017-turkish | HeadMask (Random-18) | BLEU score: 17.56 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.