The NOexistenceN of you AND me

Exclusive Insane 7 Data StructuresDesign PatternsMachine Learning Millennium

The NOexistenceN of you AND me is a visual novel developed by Fontainebleau and published by Nino Games. The game revolves around navigating through hundreds of decisions to respond to the "existence" of a mysterious girl who suddenly appeared in your room.

Implement DecisionTree, a binary decision tree classifier.

"As long as you trust, hope, and need her, Lilith will always 'exist' here. Forever... only 'existing' for you."

The NOexistenceN of you AND me


Constructor

DecisionTree(max_depth, min_samples)

  • max_depth is the maximum depth of the decision tree. The root node has depth 0, and each split increases the depth by 1. Once a node's depth == max_depth, it becomes a leaf.
  • min_samples is the minimum number of training samples required to split a node. When a node contains fewer than min_samples samples, it becomes a leaf.

Methods

fit(X, y)

  • X is a 2-dimensional container of training samples. You may assume it is in the format of the most conventional container in your programming language of choice.
    • X[i] is the i-th training sample.
    • X[i][j] is the outcome of feature j for sample i.
  • y is a container of binary labels. You may assume it is in the format of the most conventional container in your programming language of choice.
    • y[i] is the label for sample i.
    • Each label is 0 or 1.

predict(x)

  • x is a feature vector representing one sample.
    • x[j] corresponds to feature j.
    • This will have the same dimensions as a single training sample.

Training Rules

Your tree must be built greedily, top-down, using binary splits on continuous features only.

At each internal node, choose a feature index j and a threshold t, and split the node's samples into:

  • left: X[i][j] <= t
  • right: X[i][j] > t Both sides must be non-empty.

Splits are chosen to minimize weighted Gini impurity.


Gini Impurity

For a set of samples S, let:

  • p0 be the fraction of samples in S with label 0
  • p1 be the fraction of samples in S with label 1

Then Gini(S)=1(p02+p12)\text{Gini(S)}=1-(p_0^2+p_1^2).

For a candidate split into L and R, the split score is LSGini(L)+RSGini(R)\frac{|L|}{|S|}\text{Gini(}L \text ) + \frac{|R|}{|S|}\text{Gini(}R \text ).

Choose the split with the smallest score.


Candidate Thresholds

For a fixed feature j at a node, consider thresholds t of the form:

  • t = (a + b) / 2, where a < b are adjacent values after sorting the samples by X[i][j]

(Equivalently: try thresholds halfway between adjacent distinct feature values. Note: t is the exact arithmetic mean between these adjacent values.)


Stopping Conditions

A node becomes a leaf if any of the following holds:

  • all labels in the node are identical

  • depth == max_depth

  • number of samples < min_samples

  • no valid split exists (i.e., every possible split makes one side empty)

A leaf predicts the majority label of its samples.

If the leaf has a tie (equal number of 0 and 1), predict 0.


Tie-Breaking (Determinism)

If multiple splits have the same best score, choose the split with:

  • smallest feature index j

  • then smallest threshold t


Notes

  • You may NOT use any external libraries that implement decision trees or related training logic.

  • You must explicitly create the tree nodes and compute split scores from scratch.


Input Format

The first line of the input will contain 4 integers, in order:

  • N: the number of samples in the training data
  • D: the number of features in each sample
  • max_depth
  • min_samples

The next N lines will contain D integers. Each line represents one training sample.

The next line will contain N integers, each of which is either 0 or 1.

The next line will contain a single integer Q, an integer representing the number of test prediction queries you will perform.

The final Q lines will contain D integers. Each line represents a test vector you are predicting the outcome of based off the training data. Note that these queries are independent, meaning the result of one prediction should not affect another.


Output Format

Output Q integers, all of which are either 0 or 1. These are the predicted labels for the query samples.

Constraints

  • 1N20001 \leq N \leq 2000
  • 1D301 \leq D \leq 30
  • 00 \leq max_depth 20\leq 20
  • 11 \leq min_samples N+1\leq N + 1
  • 109X[i][j]109-10^9\leq X[i][j] \leq 10^9
  • y[i]{0,1}y[i] \in {0,1}
  • 1Q20001 \leq Q \leq 2000
Accepted 2/2
Acceptance 100%
Loading editor...
Sample Input:
4 2 2 1 0 0 0 1 1 0 1 1 0 0 1 1 4 0 0 1 0 0 2 2 0
Expected Output:
0 1 0 1