How to apply Naive Bayes Classifiers to document classification problems.
Within the last decades it turned out that it is often much easier to tell a computer how to learn to do a specific task rather then telling it exactly how to do it. One of the generic terms for this could be Machine Learning, which is basically summarized by Wikipedia as:
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases.
In this blog post I will introduce a simple theory-meets-practice example showing how to classify documents using a Naive Bayes Classifier together with a supervised learning strategy.
The example problem
Imagine you are working at the very cool new startup VeryCool Inc. You already have a kickass product and would like to make your customers as happy as possible. Apart from the product performing awesome you would like to offer good support to it as well to make people happy. Let’s assume your product is sold in Germany, The United Kingdom, France, Italy and Spain. You would like to have a single eMail address where support requests should be adressed to, so what you need is a classifier that can make a good guess in which of the five languages the eMail is written and redirect it to the corresponding operator. So, the problem is: “Tell me in which language an incoming eMail is written.”
Some theory
I will start with the maximum a posteriori decision rule from the Naive Bayes Classifier Wikipedia article. For it’s derivation just read the Wikipedia article from beginning.
Where the meaning of the symbols is the following:
- C (upper case) represents the document class and c (lower case) is one of the possible classes. In our case the possible classes c are the five languages (de, en, it, fr, es) we would like to classify each document with.
are called features and
are the values of the corresponding features. We will use a simple bag of words model which means our features are the words occuring in the document and their value is the number of occurences.
is the argmax function which means something like “The class c with the highest value for the following function.”
is the probability that a given document belongs to class c. Let’s assume this is the same for all our languages and hence ignore it as it is just a constant.
is the probability that the word
occurs in a document of class c.
But how do we compute the lastly mentioned probability? Assume for now we already trained the classifier though we yet don’t know exactly how. Let’s assume the classifier knows how many times he has seen the word in a document of class c and call this value
and that he knows how many times
he has seen the word at all. We can compute this probability now as following:
We have not yet taken into account the actual number of the occurences. The afore mentioned equation is suitable if we process the document word by word, but we assume we are getting the number of occurences for each word as input so we just put
as an exponent to the probability.
Taking into account all the assumptions we made so far the final decision rule will look like:
The thoughtful reader may have noticed that this goes totally wrong if (word never occured in class c) or
(word never occured in any class). But this problem could be solved with some additive smoothing:
We will use and
( because we are adding 1 for each language ).
Before we start coding there is one final transformation to be made. As a product of lots of numbers x with tends to become a number so close to zero a computer cannot represent it anymore ( arithmetic underflow ) we will apply the logarithm to our decision rule (it is left to the decent reader to figure out which logarithmic identities have been used):
Used dataset for training and tests
Well obviously we need some kind of dataset to train the classifier and to test how well it works. I decided to choose some articles from Wikipedia as there is lots of free text in the desired languages. So I downloaded all the featured articles for de, en, es, fr, it and extracted the plain text as explained in my last article. I turned the plain text into bags of words by simply splitting the string at whitespaces and putting the words as keys and their frequency as values into a hash with something like this:
def bag_of_words(text) bag = Hash.new 0 text.split(/\s/).reject{|w| w.empty?}.each do |w| bag[w] += 1 end bag end
This procedure has been repeated for alle the articles and the resulting hashes have been marshalled ( kind of serialized ) into files to speed things up a little. The training data is obtained by reading the bags of words for a language and fuse them together into a single hash, along with fusing the bags of word from all languages to get the hash containing the total occurences of each word.
I leave the code for this out, because it is an annoying script that has just been executed once to preprocess the data. Maybe I will upload the dataset somewhere if it is requested, but I did not do it yet because it is serveral hundreds of MB in size.
Gimme teh code!!!
Thanks to the awesome Ruby programming language the code should be pretty self explanatory. Anyway I put in some comments for further clarification.
class NaiveBayesClassifier # Initialize the classifier with the training data. def initialize(trained_classes, total_feature_occurences) # The keys of this hash have to be the names of the # trained classes. The values are the bag-of-word-hashes # for the corresponding class. @trained_classes = trained_classes # The total-occurence hash with words as keys and # the number of their total occurences in all classes # as data. @total_feature_occurences = total_feature_occurences end # Input is the bag of words generated from the # document to be classified. def classify(feature_set) result = Hash.new 0 # create Hash with default 0 feature_set.each do |word, wordcount| # process each word and it's count @trained_classes.each_key do |klass| # for each trained language # compute the probability and sum them up result[klass] += wordcount * Math.log(probability(word, klass)) end end # return the language with the highest result result.keys.sort_by {|key| result[key]}.last end # Calculate the probability of feature_name to occur # in a document of class klass including the additive # smoothing. def probability(feature_name, klass) # Could be in one line, but won't fit into the blog theme, so # I split it up a little. numerator = @trained_classes[klass][feature_name] + 1.0 denominator = @total_feature_occurences[feature_name] + @trained_classes.size numerator / denominator end end
Isn’t this awesome? A simple Naive Bayes Classifier in less than 50 lines. But how does it perform?
The performance investigation
I used different numbers of articles for training the classifier to get an estimate of how much training is necessary for the classifier to work properly. All remaining documents not used for training have been used as test documents.
Trained with 500 articles
---------------------------------------------------------------- Language | Total | Right | Wrong | Success rate ---------------------------------------------------------------- it | 112 | 112 | 0 | 1.000000 fr | 335 | 335 | 0 | 1.000000 de | 1436 | 1436 | 0 | 1.000000 es | 473 | 473 | 0 | 1.000000 en | 2700 | 2694 | 6 | 0.997778 ----------------------------------------------------------------
Pretty awesome result. Seems like it works very well. But what happens if we use the smallest possible amount of articles for training?
Trained with just one document per language
---------------------------------------------------------------- Language | Total | Right | Wrong | Success rate ---------------------------------------------------------------- it | 611 | 610 | 1 | 0.998363 fr | 834 | 833 | 1 | 0.998801 de | 1935 | 1935 | 0 | 1.000000 es | 972 | 970 | 2 | 0.997942 en | 3199 | 3192 | 7 | 0.997812 ----------------------------------------------------------------
Awesome. Even if the classifier has been trained with just one article it works very well. I couldn’t believe this the first time I saw it, so I injected some articles that have been labeled with the wrong language to see if they are classified wrong.
Trained with one article and 100 fault injections into the test data
---------------------------------------------------------------- Language | Total | Right | Wrong | Success rate ---------------------------------------------------------------- it | 711 | 610 | 101 | 0.857947 fr | 934 | 833 | 101 | 0.891863 de | 2035 | 1935 | 100 | 0.950860 es | 1072 | 970 | 102 | 0.904851 en | 3299 | 3197 | 102 | 0.969082 ----------------------------------------------------------------
Okay, seems like the fault injections have been recognized correctly. So, even with small amounts of training data it seems to work pretty well.
Conclusion
We all should praise Mr. Thomas Bayes for his awesome theorem. Even with the naive assumptions the classifier performs really awesome. Well, I have to admit the problem of classifying the language of a text is very simple, but once you understood the underlying theory you are able to harness this powerful tool for other classification problems as well. You can e.g. try to classify documents not by language, but by topic. There are plenty of possibilities.
Any Questions? Feel free to post them as comment.
-
http://twitter.com/jooles_k Julian Kniephoff
-
Nicolas
-
Lukx
-
Lukx
-
Anonymous
-
Anonymous
-
http://twitter.com/Ssahmadian Shaun ahmadian
-
nick reardon
-
Alex
-
Anonymous
-
Jamiu
-
Maxim
-
Maxim


Hi,