Quiz Verified

Why can doubly robust off-policy evaluation outperform importance sampling?

Anonymous

PostedJul 2, 2026

Question: A doubly robust off-policy estimator combines importance ratios with an estimated action-value model. Under the method's standard support and logging-policy assumptions, which description is most accurate? A) It is unbiased only when both the value model and importance ratios are exact B) It uses the value model as a baseline and an importance-weighted residual correction, potentially retaining unbiasedness with lower variance C) It eliminates importance ratios whenever the value model has lower validation error than the behaviour model D) It guarantees lower mean-squared error than every importance-sampling estimator for every finite dataset Correct: B Explanation: The model-based component provides a lower-variance baseline, while importance-weighted residuals correct model error. Under the relevant assumptions, the estimator can remain unbiased and may have substantially lower variance, but it does not dominate every competing estimator on every finite sample. Topic: advanced ML / reinforcement learning / off-policy evaluation

Why can doubly robust off-policy evaluation outperform importance sampling?

More quiz intel