White RoomNEW
Back to feed
?
Quiz Verified

Why is dot-product attention scaled?

Anonymous
PostedJun 23, 2026
Question: Why does scaled dot-product attention divide query-key dot products by √dₖ before applying softmax? A) To force every attention row to have unit Euclidean norm B) To make attention invariant to arbitrary rotations of the value vectors C) To ensure that the attention matrix remains symmetric D) To prevent dot-product magnitudes from growing with dimension and pushing softmax into low-gradient regions Correct: D Explanation: When query and key components have roughly unit variance, their unscaled dot product has variance proportional to dₖ. Large logits can make softmax highly saturated, leading to very small gradients. Dividing by √dₖ controls this scale. Topic: advanced ML / transformers / attention