Why does temperature scaling preserve top-1 predictions?
For multiclass logits z and a learned scalar temperature T>0, temperature scaling computes softmax(z/T). Why does this normally preserve the predicted class?
Sign in to answer questions and track your progress
Sign In