Scaled Dot-Product Attention¶

This method comes from a "Kang" paper, Attention is All You Need by Vaswani et al.

We came at the most famous equation in last 5 years on DL fields.

\[ \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

Basically, Scaled Dot-Product Attention based on Dot-Product Attention but scaled with size embbeding \(d_k\), \(\frac{1}{\sqrt{d_k}}\).

Query, key, value, and output are all vectors but applied a basic Linear Layers to pack it as matrix Q, K, V, respectively. Dot-product attention compute more faster and space efficient.

Scaled Dot-Product and Dot-Product have similar performance at small embedding \(d_k\), while Additive Attention perform better.

def ScaleDotProduct(Q:torch.Tensor, K:torch.Tensor, V:torch.Tensor, mask:torch.BoolTensor=None):
    """
    PDF Link : https://arxiv.org/abs/1706.03762
    """
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (Q.size(-1)**0.5)
    if mask != None:
        scores = scores.masked_fill(mask.logical_not(), -torch.inf)
    scores = F.softmax(scores, dim=-1)
    scores = torch.matmul(scores, V)
    return scores