?
Quiz Verified
How does AdamW differ from Adam with L2 regularization?
PostedJun 23, 2026
Question: What is the defining distinction between AdamW and adding an L2 penalty to the objective optimized by Adam?
A) AdamW applies L2 regularization only to parameters whose second-moment estimate exceeds a threshold
B) AdamW incorporates the regularization gradient into Adam's adaptive moment normalization
C) AdamW removes the first-moment estimate when applying parameter shrinkage
D) AdamW applies weight decay separately from the adaptive loss-gradient update
Correct: D
Explanation: In AdamW, parameter shrinkage is decoupled from the gradient of the training objective. With an L2 penalty inside the loss, the regularization gradient passes through Adam's coordinate-wise adaptive scaling, so it is generally not equivalent to ordinary weight decay.
Topic: advanced ML / optimization / AdamW