You're naive.
Implement a Naive Bayes text classifier from scratch.
You are given a labeled training set of documents and a set of unlabeled test documents. Your task is to train a Naive Bayes classifier on the training data and predict the label of each test document.
Input Format
num_classes num_train num_test
label_0 label_1 ... label_{num_classes-1}
<training documents>
<test documents>
Training documents
Each training document is two lines:
<label>
<text>
labelis one of the class labels defined in the header.textis a single line of raw text.
There are exactly num_train training documents.
Test documents
Each test document is a single line of raw text. There are exactly num_test test documents.
Output Format
One line per test document, in input order:
predicted_label
Tokenization
Tokenization rules, applied in this order:
- Lowercase the entire text.
- Replace every character that is not a lowercase letter (
a–z) or a space with a space. - Split on whitespace. Discard any empty tokens that result.
Example: "Hello, World! 123" → ["hello", "world"]
These rules must be applied identically to both training and test documents.
Model
Prior
The prior probability of class c is:
P(c) = (number of training documents with label c) / num_train
Likelihood
For each class c and each word w in the vocabulary:
P(w | c) = (count of w in class c + 1) / (total tokens in class c + |V|)
where |V| is the total number of unique words across the entire training set (all classes combined). This is Laplace (additive) smoothing.
- "count of w in class c" is the number of times
wappears across all training documents labeledc. - "total tokens in class c" is the total number of tokens (with repetition) across all training documents labeled
c.
Classification
For a test document with tokens w_1, w_2, ..., w_t, the predicted class is:
argmax over c of: log P(c) + sum_{i=1}^{t} log P(w_i | c)
- Computation must be done in log space to avoid underflow. Do not multiply raw probabilities.
- If a token in the test document was not seen during training (i.e., it is not in
|V|), skip it — do not include it in the sum. - If two or more classes are tied (same log-probability), predict the class that appears first in the label list provided in the header.
Behavioral Requirements
- You must not use any ML, NLP, or statistics library. Basic math (
log, etc.) is permitted. - The vocabulary
|V|is built from training data only. Test-only words do not expand the vocabulary. - Laplace smoothing must be applied uniformly — every word in
|V|gets +1 in every class, regardless of whether it appears in that class. - Priors must use raw document counts, not token counts.
Edge Cases
- A test document may tokenize to zero tokens (e.g., it contains only punctuation and digits). In this case, the prediction is based on priors alone: predict the class with the highest
log P(c), breaking ties by header order. - A class may have zero training documents. Its prior is
0, solog P(c)is undefined (-inf). It can never be predicted.
Notes
- All arithmetic must use
double(or equivalent 64-bit float). - The classifier is a standard Multinomial Naive Bayes with Laplace smoothing. There is no stemming, stop-word removal, or TF-IDF — only raw token counts.
2 4 2
spam ham
spam
buy now free money
spam
click here free offer
ham
hello how are you
ham
meeting at three today
free money click now
how are you doingspam
ham