## How to apply Naive Bayes Classifiers to document classification problems.

Within the last decades it turned out that it is often much easier to tell a computer how to learn to do a specific task rather then telling it exactly how to do it. One of the generic terms for this could be Machine Learning, which is basically summarized by Wikipedia as:

Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases.

In this blog post I will introduce a simple theory-meets-practice example showing how to classify documents using a Naive Bayes Classifier together with a supervised learning strategy.

## The example problem

Imagine you are working at the very cool new startup VeryCool Inc. You already have a kickass product and would like to make your customers as happy as possible. Apart from the product performing awesome you would like to offer good support to it as well to make people happy. Let’s assume your product is sold in Germany, The United Kingdom, France, Italy and Spain. You would like to have a single eMail address where support requests should be adressed to, so what you need is a classifier that can make a good guess in which of the five languages the eMail is written and redirect it to the corresponding operator. So, the problem is: “Tell me in which language an incoming eMail is written.”

## Some theory

I will start with the maximum a posteriori decision rule from the Naive Bayes Classifier Wikipedia article. For it’s derivation just read the Wikipedia article from beginning.

$\text{classify}(f_1,\dots,f_n) = \text{argmax}_c \ p(C=c) \prod_{i=1}^n p(F_i=f_i\vert C=c)$

Where the meaning of the symbols is the following:

• C (upper case) represents the document class and c (lower case)  is one of the possible classes. In our case the possible classes c are the five languages (de, en, it, fr, es) we would like to classify each document with.
• $F_i, \dots, F_n$ are called features and $f_i, \dots, f_n$ are the values of the corresponding features. We will use a simple bag of words model which means our features are the words occuring in the document and their value is the number of occurences.
• $\text{argmax}_c$ is the argmax function which means something like “The class c with the highest value for the following function.”
• $p(C = c)$ is the probability that a given document belongs to class c. Let’s assume this is the same for all our languages and hence ignore it as it is just a constant.
• $p(F_i = f_i \vert C = c)$ is the probability that the word $F_i$ occurs in a document of class c.

But how do we compute the lastly mentioned probability? Assume for now we already trained the classifier though we yet don’t know exactly how. Let’s assume the classifier knows how many times he has seen the word $F_i$ in a document of class c and call this value $f_{ic}$ and that he knows how many times $f_{i total}$ he has seen the word at all. We can compute this probability now as following:

$p(F_i = f_i \vert C = c) = \frac{f_{ic}}{f_{i total}}$

We have not yet taken into account the actual number $f_i$ of the occurences. The afore mentioned equation is suitable if we process the document word by word, but we assume we are getting the number of occurences for each word as input so we just put  $f_i$ as an exponent to the probability.

Taking into account all the assumptions we made so far the final decision rule will look like:

$\text{classify}(f_1,\dots,f_n) = \text{argmax}_c \prod_{i=1}^n \left( \frac{f_{ic}}{f_{i total}}\right)^{f_i}$

The thoughtful reader may have noticed that this goes totally wrong if $f_{ic} = 0$ (word never occured in class c) or $f_{i total} = 0$  (word never occured in any class). But this problem could be solved with some additive smoothing:

$\text{classify}(f_1,\dots,f_n) = \text{argmax}_c \prod_{i=1}^n \left(\frac{f_{ic} + \alpha}{f_{i total} + k\cdot \alpha}\right)^{f_i}$

We will use $\alpha = 1$ and $k = 5$ ( because we are adding 1 for each language ).

Before we start coding there is one final transformation to be made. As a product of lots of numbers x with $0 < x < 1$ tends to become a number so close to zero a computer cannot represent it anymore ( arithmetic underflow ) we will apply the logarithm to our decision rule (it is left to the decent reader to figure out which logarithmic identities have been used):

$\text{classify}(f_1,\dots,f_n) = \text{argmax}_c \sum_{i=1}^n f_i \cdot \ln\frac{f_{ic} + \alpha}{f_{i total} + k\cdot \alpha}$

## Used dataset for training and tests

Well obviously we need some kind of dataset to train the classifier and to test how well it works. I decided to choose some articles from Wikipedia as there is lots of free text in the desired languages. So I downloaded all the featured articles for de, en, es, fr, it and extracted the plain text as explained in my last article. I turned the plain text into bags of words by simply splitting the string at whitespa﻿ces and putting the words as keys and their frequency as values into a hash with something like this:

def bag_of_words(text) bag = Hash.new 0 text.split(/\s/).reject{|w| w.empty?}.each do |w| bag[w] += 1 end bag end

This procedure has been repeated for alle the articles and the resulting hashes have been marshalled ( kind of serialized ) into files to speed things up a little. The training data is obtained by reading the bags of words for a language and fuse them together into a single hash, along with fusing the bags of word from all languages to get the hash containing the total occurences of each word.

I leave the code for this out, because it is an annoying script that has just been executed once to preprocess the data. Maybe I will upload the dataset somewhere if it is requested, but I did not do it yet because it is serveral hundreds of MB in size.

## Gimme teh code!!!

Thanks to the awesome Ruby programming language the code should be pretty self explanatory. Anyway I put in some comments for further clarification.

class NaiveBayesClassifier # Initialize the classifier with the training data. def initialize(trained_classes, total_feature_occurences) # The keys of this hash have to be the names of the # trained classes. The values are the bag-of-word-hashes # for the corresponding class. @trained_classes = trained_classes   # The total-occurence hash with words as keys and # the number of their total occurences in all classes # as data. @total_feature_occurences = total_feature_occurences end   # Input is the bag of words generated from the # document to be classified. def classify(feature_set) result = Hash.new 0 # create Hash with default 0 feature_set.each do |word, wordcount| # process each word and it's count @trained_classes.each_key do |klass| # for each trained language # compute the probability and sum them up result[klass] += wordcount * Math.log(probability(word, klass)) end end # return the language with the highest result result.keys.sort_by {|key| result[key]}.last end   # Calculate the probability of feature_name to occur # in a document of class klass including the additive # smoothing. def probability(feature_name, klass) # Could be in one line, but won't fit into the blog theme, so # I split it up a little. numerator = @trained_classes[klass][feature_name] + 1.0 denominator = @total_feature_occurences[feature_name] + @trained_classes.size numerator / denominator end end

Isn’t this awesome? A simple Naive Bayes Classifier in less than 50 lines. But how does it perform?

## The performance investigation

I used different numbers of articles for training the classifier to get an estimate of how much training is necessary for the classifier to work properly. All remaining documents not used for training have been used as test documents.

#### Trained with 500 articles

----------------------------------------------------------------
Language   | Total      | Right      | Wrong      | Success rate
----------------------------------------------------------------
it         | 112        | 112        | 0          | 1.000000
fr         | 335        | 335        | 0          | 1.000000
de         | 1436       | 1436       | 0          | 1.000000
es         | 473        | 473        | 0          | 1.000000
en         | 2700       | 2694       | 6          | 0.997778
----------------------------------------------------------------

Pretty awesome result. Seems like it works very well. But what happens if we use the smallest possible amount of articles for training?

#### Trained with just one document per language

----------------------------------------------------------------
Language   | Total      | Right      | Wrong      | Success rate
----------------------------------------------------------------
it         | 611        | 610        | 1          | 0.998363
fr         | 834        | 833        | 1          | 0.998801
de         | 1935       | 1935       | 0          | 1.000000
es         | 972        | 970        | 2          | 0.997942
en         | 3199       | 3192       | 7          | 0.997812
----------------------------------------------------------------

Awesome. Even if the classifier has been trained with just one article it works very well. I couldn’t believe this the first time I saw it, so I injected some articles that have been labeled with the wrong language to see if they are classified wrong.

#### Trained with one article and 100 fault injections into the test data

----------------------------------------------------------------
Language   | Total      | Right      | Wrong      | Success rate
----------------------------------------------------------------
it         | 711        | 610        | 101        | 0.857947
fr         | 934        | 833        | 101        | 0.891863
de         | 2035       | 1935       | 100        | 0.950860
es         | 1072       | 970        | 102        | 0.904851
en         | 3299       | 3197       | 102        | 0.969082
----------------------------------------------------------------

Okay, seems like the fault injections have been recognized correctly. So, even with small amounts of training data it seems to work pretty well.

## Conclusion

We all should praise Mr. Thomas Bayes for his awesome theorem. Even with the naive assumptions the classifier performs really awesome. Well, I have to admit the problem of classifying the language of a text is very simple, but once you understood the underlying theory you are able to harness this powerful tool for other classification problems as well. You can e.g. try to classify documents not by language, but by topic. There are plenty of possibilities.

Any Questions? Feel free to post them as comment.

Very interesting write up and pretty surprising results in my opinion. Have you tried feeding in articles in languages other than the “designated” five? Would be interesting to see how well the classifier recognizes “proximity” of languages he does not know; especially when you think about classifying documents by topic, where there is no such clear distinction between the classes.

• Nicolas

So, so… Aber bei der länge der Wikipedia-Artikel ist die sehr gute Performance auch kein Wunder. Interessanter wäre es mal nur den ersten Satz aus jedem Artikel zu nehmen. Sowohl zum Klassifizieren als auch zum Lernen. Schließlich schreiben die meisten Support-Anfragenden eher etwas wie “Produkt XY funktioniert nicht mehr. Was soll ich machen?!” und keine 12-seitige Abhandlung darüber, warum es nicht mehr funktioniert .

• Lukx

What I am wondering

• Lukx

I am wondering whether one could adopt the algorithm to not only detect native languages, but much rather programming languages? Like, I give any type of code to my script and it figures out what language it is in?

• Anonymous

Yes, of course. You do not even have to change the algorithm. Only the feature selection and the training data has to be changed. Instead of using words and their occurence count you can split code at whitespaces and use whatever you get then as features. You can use some open source projects as training data. Just make sure you are using equal amounts of training data for each programming language.

Let me know how it works if you try it!

• Anonymous

@Julian, @Nicolas

This is of course far away from production usable. The intention of the post is to introduce the basic principles of Naive Bayes Classification. In reality you would deal with things like character encodings or improve feature selection by using n-grams ( http://en.wikipedia.org/wiki/N-gram ) or do a more sophisticated preprocessing, e.g. stripping out punctuation and similar things.

And of course you need some mechanism to deal with unknown languages. Perhaps you would use a more complex decision rule to get the explicit probability of your guess to be able to decide if it makes sense.

Testing single sentences is a good idea. Will try that when I have some time again. Trying out how small the training data could be for the classifier to work properly is interesting as well. But as this method is a data driven approach you would like to use as much training data as possible to get best results.

Thanks for the great post!

If anyone is interested, Yelp has an implementation in Python using map-reduce framework (mrjob library):https://github.com/Yelp/dataset-examples

• nick reardon

Can you put this into Python and what does it all mean to me. I’m a linguist, I came up with a theory of grammar that seems to be merging towards naive bayes classifiers, but I don’t think it’s quite there, partially because I don’t understand the programming, but also the terminology. “Fault injections,” The languages are di-alphabetic and have no knowledge known to me behind the word. I think we should merge computer knowledge with human knowledge, and the best way to do it is through the use of pattern recognition, but I don’t think we need to classify things, we just need to find out how to determine the parameters of an item (being any symbol/combination of symbols on any level[letter, morphemes, words, phrases, sentences, input/output, conversational, etc]) and how to store knowledge within that environment. The type of patterns it recognizes would be hardly conceivable to a human. One of which would be how many calculations it has made, where it is stored, and how to use it to prompt a response, or to make presumptions and assumptions from the input it receives. Take this example.

“This is how you search for something.”
and
“This is how you make a search.”

The item in this case seems to be a sort of template, where the common denominator is “This+is+how+you+search.” We need to get it to recognize that as an item and store operational data within it. Then we can make it do things for us.

• Alex

hi nils, have you written a scientific paper of naive bayes too, that I could quote for a thesis?

• Anonymous

Hi Alex,

I have not published anything yet, still working on my M.Sc.

There are some references at the bottom of the wikipedia articles. To quickly find some more papers just type “naive bayes” into google scholar and you will find LOTS of results. Maybe add some keywords matching the topic of your thesis,…

• Jamiu

Thanks. Great article.

• Maxim

One note regarding the smoothing parameters in the algorithm you presented. When you evaluate results for each of classes, you calculate resulting probability as a sum over the feature set. Therefore would it be more accurate to use the size of feature set (number of different words) instead of number of classes?

• Maxim

I meant “use this number as k” in the previous comment. Sorry for confusing)

• http://yawar.blogspot.com/ Yawar Amin

Nils, where you have (in the ‘Some theory’ section):

P(F_i = f_i | C = c) = f_{ic} / f_{itotal}

Shouldn’t that be:

P(F_i = f_i | C = c) = f_{ic} / f_c

?

In other words, since the condition is C = c we want to calculate based on the occurrence frequency of language c?

Thanks for the write-up, I learned the additive smoothing trick and you reminded
me to take the natural log.