You're naive.

10s 256 MB

Exclusive Normal+ 4

BasicsMachine Learning +1 MetaGoogle

Implement a Naive Bayes text classifier from scratch.

You are given a labeled training set of documents and a set of unlabeled test documents. Your task is to train a Naive Bayes classifier on the training data and predict the label of each test document.

num_classes num_train num_test
label_0 label_1 ... label_{num_classes-1}
<training documents>
<test documents>

Training documents

Each training document is two lines:

<label>
<text>

label is one of the class labels defined in the header.
text is a single line of raw text.

There are exactly num_train training documents.

Test documents

Each test document is a single line of raw text. There are exactly num_test test documents.

One line per test document, in input order:

predicted_label

Tokenization rules, applied in this order:

Lowercase the entire text.
Replace every character that is not a lowercase letter (a–z) or a space with a space.
Split on whitespace. Discard any empty tokens that result.

Example: "Hello, World! 123" → ["hello", "world"]

These rules must be applied identically to both training and test documents.

Prior

The prior probability of class c is:

P(c) = (number of training documents with label c) / num_train

Likelihood

For each class c and each word w in the vocabulary:

P(w | c) = (count of w in class c + 1) / (total tokens in class c + |V|)

where |V| is the total number of unique words across the entire training set (all classes combined). This is Laplace (additive) smoothing.

"count of w in class c" is the number of times w appears across all training documents labeled c.
"total tokens in class c" is the total number of tokens (with repetition) across all training documents labeled c.

Classification

For a test document with tokens w_1, w_2, ..., w_t, the predicted class is:

argmax over c of:  log P(c) + sum_{i=1}^{t} log P(w_i | c)

Computation must be done in log space to avoid underflow. Do not multiply raw probabilities.
If a token in the test document was not seen during training (i.e., it is not in |V|), skip it — do not include it in the sum.
If two or more classes are tied (same log-probability), predict the class that appears first in the label list provided in the header.

You must not use any ML, NLP, or statistics library. Basic math (log, etc.) is permitted.
The vocabulary |V| is built from training data only. Test-only words do not expand the vocabulary.
Laplace smoothing must be applied uniformly — every word in |V| gets +1 in every class, regardless of whether it appears in that class.
Priors must use raw document counts, not token counts.

A test document may tokenize to zero tokens (e.g., it contains only punctuation and digits). In this case, the prediction is based on priors alone: predict the class with the highest log P(c), breaking ties by header order.
A class may have zero training documents. Its prior is 0, so log P(c) is undefined (-inf). It can never be predicted.

All arithmetic must use double (or equivalent 64-bit float).
The classifier is a standard Multinomial Naive Bayes with Laplace smoothing. There is no stemming, stop-word removal, or TF-IDF — only raw token counts.

Accepted 6/18

Acceptance 33%

Next up

Omniscient Reader's Viewpoint Similar problem

Normal+ 4

Down the Rabbit-Hole Similar problem

Normal 3

Stronger Together Similar problem

Normal 3

Loading editor...

Input

2 4 2
spam ham
spam
buy now free money
spam
click here free offer
ham
hello how are you
ham
meeting at three today
free money click now
how are you doing

Expected Output

spam
ham