You're naive.

Exclusive Normal+ 4 BasicsMachine LearningMath MetaGoogle

Implement a Naive Bayes text classifier from scratch.

You are given a labeled training set of documents and a set of unlabeled test documents. Your task is to train a Naive Bayes classifier on the training data and predict the label of each test document.


Input Format

num_classes num_train num_test
label_0 label_1 ... label_{num_classes-1}
<training documents>
<test documents>

Training documents

Each training document is two lines:

<label>
<text>

  • label is one of the class labels defined in the header.
  • text is a single line of raw text.

There are exactly num_train training documents.

Test documents

Each test document is a single line of raw text. There are exactly num_test test documents.


Output Format

One line per test document, in input order:

predicted_label


Tokenization

Tokenization rules, applied in this order:

  1. Lowercase the entire text.
  2. Replace every character that is not a lowercase letter (az) or a space with a space.
  3. Split on whitespace. Discard any empty tokens that result.

Example: "Hello, World! 123"["hello", "world"]

These rules must be applied identically to both training and test documents.


Model

Prior

The prior probability of class c is:

P(c) = (number of training documents with label c) / num_train

Likelihood

For each class c and each word w in the vocabulary:

P(w | c) = (count of w in class c + 1) / (total tokens in class c + |V|)

where |V| is the total number of unique words across the entire training set (all classes combined). This is Laplace (additive) smoothing.

  • "count of w in class c" is the number of times w appears across all training documents labeled c.
  • "total tokens in class c" is the total number of tokens (with repetition) across all training documents labeled c.

Classification

For a test document with tokens w_1, w_2, ..., w_t, the predicted class is:

argmax over c of:  log P(c) + sum_{i=1}^{t} log P(w_i | c)

  • Computation must be done in log space to avoid underflow. Do not multiply raw probabilities.
  • If a token in the test document was not seen during training (i.e., it is not in |V|), skip it — do not include it in the sum.
  • If two or more classes are tied (same log-probability), predict the class that appears first in the label list provided in the header.

Behavioral Requirements

  • You must not use any ML, NLP, or statistics library. Basic math (log, etc.) is permitted.
  • The vocabulary |V| is built from training data only. Test-only words do not expand the vocabulary.
  • Laplace smoothing must be applied uniformly — every word in |V| gets +1 in every class, regardless of whether it appears in that class.
  • Priors must use raw document counts, not token counts.

Edge Cases

  • A test document may tokenize to zero tokens (e.g., it contains only punctuation and digits). In this case, the prediction is based on priors alone: predict the class with the highest log P(c), breaking ties by header order.
  • A class may have zero training documents. Its prior is 0, so log P(c) is undefined (-inf). It can never be predicted.

Notes

  • All arithmetic must use double (or equivalent 64-bit float).
  • The classifier is a standard Multinomial Naive Bayes with Laplace smoothing. There is no stemming, stop-word removal, or TF-IDF — only raw token counts.
Accepted 1/7
Acceptance 14%
Loading editor...
Sample Input:
2 4 2 spam ham spam buy now free money spam click here free offer ham hello how are you ham meeting at three today free money click now how are you doing
Expected Output:
spam ham