White RoomNEW

Why can ECE comparisons be unstable?

Two researchers evaluate the same classifier and obtain substantially different Expected Calibration Error values. Which explanation is most technically plausible?