Why can doubly robust off-policy evaluation outperform importance sampling?
A doubly robust off-policy estimator combines importance ratios with an estimated action-value model. Under the method's standard support and logging-policy assumptions, which description is most accurate?
Sign in to answer questions and track your progress
Sign In