Andrew Shirokoff: An abstract example for "The Optimality of Naive Bayes". Part 1.

Here I will try to provide a small illustration for Harry Zhang's article "The Optimality of Naive Bayes". This example will explain how Naive Bayes classifier can be optimal in case of dependencies between attributes. If you need more background about the subject read the article first.

Let's consider a document classification problem, for simplicity we will search for only two words in each document: W1 and W2. We will use boolean values (T and F) to encode possible variable values. W1=T means some document contains W1, W1=F means otherwise. Each document can be classified into 2 categories: more important (C=T) and less important (C=F).

Let's suppose we have probabilities for Naive Bayes classifier as in tables below.

Category probabilities

	C=T	C=F
P	0.6	0.4

Conditional probabilities for W1:

	W1=T	W1=F
C=T	0.1	0.9
C=F	0.3	0.7

Conditional probabilities for W2:

	W2=T	W2=F
C=T	0.3	0.7
C=F	0.4	0.6

Now, let's calculate scores for NB (Naive Bayes) classifier. For example, what category will NB classifier select for document with W1=T, W2=T? According to Bayes rule and independence assumption we can calculate:

P (C = T | W 1 = T, W 2 = T) = \frac{P (W 1 = T | C = T) \times P (W 2 = T | C = T)}{P (W 1 = T, W 2 = T)} \times P (C = T)

By using values from tables above and removing denominator, because it's the same for both categories we get:

S c o r e (C = T | W 1 = T, W 2 = T) = 0.1 \times 0.3 \times 0.6 = 0.018

similarly for C=F:

S c o r e (C = F | W 1 = T, W 2 = T) = 0.3 \times 0.4 \times 0.4 = 0.048

So, NB classifier tells us to select C=F category.

Now, let's assume that our attribute independence assumption is wrong and W2 attribute depends on W1 and dependency can be described by following conditional probability table:

C	W1	W2=T	W2=F
C=T	W1=T	0.4	0.6
C=T	W1=F	0.3	0.7
C=F	W1=T	0.5	0.5
C=F	W1=F	0.4	0.6

We can calculate joint distribution according to Bayes network rule (see details on wiki):

C	W1	W2	P(C,W1,W2)
T	T	T	0.024
T	T	F	0.036
T	F	T	0.162
T	F	F	0.378
F	T	T	0.06
F	T	F	0.06
F	F	T	0.112
F	F	F	0.168

First values was calculated as:

P (W 1 = T, W 2 = T, C = T) = P (C = T) \times P (W 1 = T | C = T) \times P (W 2 = T | W 1 = T, C = T)

P (W 1 = T, W 2 = T, C = T) = 0.6 \times 0.1 \times 0.4 = 0.024

Now let's calculate several conditional probabilities taking into account local dependency:

	P
P(W2=T\|W1,C=T)	0.31
P(W2=F\|W1,C=T)	0.69
P(W2=T\|W1,C=F)	0.43
P(W2=F\|W1,C=F)	0.57

First value is calculated according to Bayes network rule as follows:

P (W 2 = T | W 1, C = T) = \frac{P (W 1, W 2 = T, C = T)}{P (W 1, W 2, C = T)} = \frac{P (W 1 = T, W 2 = T, C = T) + P (W 1 = F, W 2 = T, C = T)}{P (W 1 = T, W 2 = T, C = T) + P (W 1 = T, W 2 = F, C = T) + P (W 1 = F, W 2 = T, C = T) + P (W 1 = F, W 2 = F, C = T)}

\frac{0.024 + 0.162}{0.024 + 0.036 + 0.162 + 0.378} = \frac{0.186}{0.6} = 0.31

Then ddr values can be calculated as follows:

d d r (T) = \frac{d d + (T)}{d d - (T)} = \frac{P (W 2 = T | W 1, C = T)}{P (W 2 = T | C = T)} \times \frac{P (W 2 = T | C = F)}{P (W 2 = T | W 1, C = F)}

d d r (T) = \frac{0.31}{0.3} \times \frac{0.4}{0.43} \approx 0.961

Similarly for ddr(F):

d d r (F) = \frac{0.69}{0.7} \times \frac{0.6}{0.57} \approx 1.04

Both ddr(T) and ddr(F) are close to 1. So, we can conclude, due to Harry's article, that local dependency distributed evenly in both categories and doesn't change classifier decision. Also, it means that NB classifier is optimal solution for the problem.
In the next part of this post we will make sure of optimality of NB classifier and consider contr-example where NB classifier will not be optimal.

Andrew Shirokoff

Wednesday, 30 December 2015

An abstract example for "The Optimality of Naive Bayes". Part 1.

No comments:

Post a Comment