Quiz Verified

Why is dot-product attention scaled?

Anonymous

PostedJun 23, 2026

Question: Why does scaled dot-product attention divide query-key dot products by √dₖ before applying softmax? A) To force every attention row to have unit Euclidean norm B) To make attention invariant to arbitrary rotations of the value vectors C) To ensure that the attention matrix remains symmetric D) To prevent dot-product magnitudes from growing with dimension and pushing softmax into low-gradient regions Correct: D Explanation: When query and key components have roughly unit variance, their unscaled dot product has variance proportional to dₖ. Large logits can make softmax highly saturated, leading to very small gradients. Dividing by √dₖ controls this scale. Topic: advanced ML / transformers / attention

Why is dot-product attention scaled?

More quiz intel