Return by Death

Exclusive Hard 5.5 Machine LearningMathBasics Anthropic

In the series Re:Zero - Starting Life in Another World, the main character Natsuki Subaru dies countless times, utilizing his "Return by Death" ability to return to the past. With each iteration, he uses knowledge of future events to adjust his actions and achieve an optimal outcome.

Similarly, gradient descent is an iterative optimization process. Like Subaru's repeated attempts, the algorithm starts with poor initial parameters and "dies" (produces high error). Through repeated iterations, it uses knowledge of past errors (gradients) to adjust its parameters, converging toward an optimal prediction model.

Implement PolynomialRegression, a polynomial regression model of degree d trained via gradient descent that iteratively learns from its errors to achieve optimal predictions.

Re:ZERO


Constructor

PolynomialRegression(degree, learning_rate, max_iterations, tolerance)

  • degree is the polynomial degree d. The model fits a polynomial of the form: y^=θ0+θ1x+θ2x2++θdxd\hat{y} = \theta_0 + \theta_1 x + \theta_2 x^2 + \cdots + \theta_d x^d
  • learning_rate is the step size used during gradient descent parameter updates.
  • max_iterations is the maximum number of gradient descent steps to perform.
  • tolerance is the early stopping threshold. If the change in loss between two consecutive iterations is less than tolerance, training stops.

Methods

fit(X, y)

  • X is a container of training input values (scalars).
    • X[i] is the input value for sample i.
  • y is a container of target values (continuous).
    • y[i] is the target for sample i.
  • Before training, normalize the input feature x to have mean 0 and standard deviation 1 across the training set. Store the mean and standard deviation so that future predictions can be transformed into the same space.
  • Construct polynomial features internally: for each training sample with normalized input x', generate features [1, x', (x')^2, ..., (x')^d].

predict(x)

  • x is a scalar input value representing one sample.
  • Normalize x using the mean and standard deviation computed during fit.
  • Construct polynomial features [1, x', (x')^2, ..., (x')^d] where x' is the normalized input.
  • Return the predicted value: y^=θ0+θ1x+θ2(x)2++θd(x)d\hat{y} = \theta_0 + \theta_1 x' + \theta_2 (x')^2 + \cdots + \theta_d (x')^d

loss_history()

  • Returns the list of MSE loss values, one per iteration performed during training (including the initial loss before any update).

Training Rules

The model learns a parameter vector θ = [θ₀, θ₁, θ₂, ..., θₐ] where d+1 is the total number of parameters.

Initialize all parameters to 0.

At each iteration:

  1. Compute predictions for all training samples using the current parameters.
  2. Calculate the MSE loss.
  3. Compute gradients for all parameters with respect to the MSE loss.
  4. Update all parameters simultaneously using gradient descent.

Do not update parameters one at a time — all updates in a single iteration use the gradients computed from the same set of predictions.

Training stops when either max_iterations is reached or the absolute change in loss between two consecutive iterations is less than tolerance.


Gradient Descent Updates

For each parameter θⱼ where j ∈ {0, 1, 2, ..., d}, the gradient is:

MSEθj=2Ni=1N(y^iyi)(xi)j\frac{\partial \text{MSE}}{\partial \theta_j} = \frac{2}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i) \cdot (x'_i)^j

where x'ᵢ is the normalized input for sample i, and (x'ᵢ)⁰ = 1.

The update rule is:

θjθjlearning_rateMSEθj\theta_j \leftarrow \theta_j - \text{learning_rate} \cdot \frac{\partial \text{MSE}}{\partial \theta_j}


Loss Function

For predictions y^i\hat{y}_i and targets yiy_i over NN samples:

MSE=1Ni=1N(y^iyi)2\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2


Input Normalization

Compute across the training set:

μ=1Ni=1NX[i]\mu = \frac{1}{N} \sum_{i=1}^{N} X[i]

σ=1Ni=1N(X[i]μ)2\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (X[i] - \mu)^2}

Replace each input value with:

xi=X[i]μσx'_i = \frac{X[i] - \mu}{\sigma}

If σ=0\sigma = 0 (i.e., all inputs are identical), set all normalized values to 0.


Notes

  • You may not use any polynomial regression libraries or utilities. Compute all gradients and updates manually.
  • You may not use the closed-form (normal equation) solution.
  • Polynomial features can grow rapidly in magnitude. Normalization is critical to prevent numerical instability.
  • Answers within ±0.01 of the expected value will be marked as correct.

Input Format

The first line contains 5 values, in order:

  • N: the number of training samples
  • d: the polynomial degree
  • learning_rate
  • max_iterations
  • tolerance

The next N lines each contain two floating-point numbers x y, representing one training sample.

The next line contains a single integer Q, the number of prediction queries.

The final Q lines each contain a single floating-point number, representing the input values to predict on.


Output Format

Output Q floating-point numbers, one per line, each rounded to 4 decimal places. These are the predicted target values for each query sample.

Examples

Example 1
Input:5 2 0.01 10000 1e-6 1.0 2.5 2.0 7.0 3.0 14.5 4.0 25.0 5.0 38.5 3 1.5 3.5 6.0
Output:4.2500 19.2500 59.5000
Explanation:The training data follows the pattern `y = 0.5 - 0.5x + 2x²`. After normalization and training, the model learns to approximate this relationship and makes predictions on the three query points.
Accepted 1/1
Acceptance 100%
Loading editor...
Sample Input:
5 1 0.01 10000 1e-6 1.0 3.0 2.0 5.0 3.0 7.0 4.0 9.0 5.0 11.0 3 1.5 3.5 6.0
Expected Output:
3.9974 7.9949 12.9917