Dot-Product Attention¶
To the best of my knowledge, Luong's Dot-Product Attention or Multiplicative Attention is the first introduced Dot-Product Attention. Luong's Attention is a resambles of Bahdanau's Additive Attention. Luong's purposed 3 different alternatives alingment functions.
FN [\(a_t(h_t, \hat{h}_s\))] | Content |
---|---|
\(h_t^T \cdot \hat{h}s\) | dot |
\(h_t^T \cdot \textbf{W}_a\hat{h}_s\) | general |
\(v_a^T\tanh(\textbf{W}_a[h_t;\hat{h}_s])\) | concat |
We can see it compare to Bandahau's Attention, Luong's Attention doesn't use the last decoder state (\(s_{i-1}\)) and lesser learnable parameter, it means Luong's Attention require less computation. As Luong et. al stated on their paper, their's version of Attention doesn't require a Bidirectional RNN.
class MultiplicativeAttention(nn.Module):
"""
PDF Link : https://arxiv.org/abs/1508.04025
"""
def __init__(self, embedding_size):
super(MultiplicativeAttention, self).__init__()
self.Wa = nn.Linear(embedding_size, embedding_size)
def forward(self, annotations, hidden_state):
hidden_state = hidden_state.unsqueeze(1) # batch, 1, embedding_size
scores = self.Wa(annotations) @ torch.transpose(hidden_state, -2,-1) # batch, seq_len, 1
weights = F.softmax(scores, dim=-1) # batch, seq_len, 1
context = torch.sum(weights * annotations, dim=1) # batch, embedding_size
return context