White RoomNEW

Why is dot-product attention scaled?

Why does scaled dot-product attention divide query-key dot products by √dā‚– before applying softmax?