Today, we will learn about the competence of a very “naive” algorithm known as Naive Bayes. We would also be discussing its implementation (from scratch) on a dataset containing both categorical and continuous values.

Naive Bayes is a supervised learning algorithm used for classification problems. It comes from the family of “probabilistic classifiers” and applies Bayes theorem for classifying objects with the assumption that all the features are independent of each other. This is the assumptions that makes it “naive” because in real life, it is highly unlikely to find datasets which has no multicollinearity or zero correlation amongst the predictors.

Despite this assumption, it does a pretty good job of classifying objects.

So before we implement our naive bayes algorithm, we should first understand what Bayes Theorem is all about.

Before we learn about Bayes theorem, we need to understand the concept of conditional probability.

## Conditional Probability:

Conditional probability means the probability of an event occurring given that some condition is true. It is represented as P(event|condition).

Let’s understand it using a “very sophisticated” example.

Suppose a person wants to punch a wall (I don’t know why though but moving on). What is the probability that it would hurt given he punches the wall? Here, the event of which we find the probability is “hurt hands” and the condition is “punching the wall”.

Our example would be represented as P(“hurt hands”|“punching a wall”).

It should be noted that,

P(“hurt hands”|“punching a wall”) ≠P(“punching a wall”|“hurt hands”)

The probability of a person hurting his hand after punching a wall is not the same as the probability of him punching a wall while his hand is hurting. (Just read it a few times and it would make sense to you)

## Coming back to Bayes Theorem…

This theorem was introduced by this gentleman below:

It gives us a way for calculating the conditional probability.

Mathematically this can be described as,

Many times, we are not given the value of the denominator P(B)*(this term is also known as evidence)*. We can calculate this by using the following equation:

P(B) = P(B|A)*P(A) + P(B|not A)*P(not A)

**Note that:**

- P(not A) = 1-P(A)
- P(B|not A) = 1-P(not B|not A)

Using the above equation, we have a new formulation of the Bayes theorem as,

P(A|B) = P(B|A)*P(A)/(P(B|A)*P(A) + P(B|not A)*P(not A))

- Find different classes/groups in our response variable. In our example, the response variable is “flu?” and it is divided into 2 classes: “Yes” and “No”.
- For each predictor variable, we determine the different types of values it has. For example, for “chills” , “fever” and “runny nose” columns they have values of types: “Yes” and “No”; for “headache” column it has “Mild”, “No”and “Strong”.
- For each type of value in a particular predictor, we find out their “likelihood” of being classified into a certain class of response variable. For example,

In case someone has flu what is the probability that they have a mild headache or P(headaches= “mild”|flu = “yes”) ?

Total number of rows having (headaches = “mild” ) is 3. Out of these three rows, we see that 2 of them are classified into “yes” class in the “flu?” column.

Hence, P(headaches= “mild”|flu = “yes”) = 2/3

Similarly, we calculate the others as follows:

**Given that flu = “yes” (likelihoods):**

*P(headaches= “mild”|flu = “yes”) = 2/3,**P(headaches= “no” |flu = “yes”) = 1/2**P(headaches= “strong”|flu = “yes”)=2/3**P(chills= “yes”|flu = “yes”) = 3/4 and P(chills= “no”|flu = “yes”) = 2/4**P(runny nose= “yes”|flu = “yes”) = 4/5 and P(runny nose= “no”|flu = “yes”) = 1/3**P(fever= “yes”|flu = “yes”) = 4/5 and P(fever= “no”|flu = “yes”) = 1/3*

**Given that flu = “no” **(likelihoods)**:**

*P(headaches= “mild”|flu = “no”)**= 1/3**P(headaches= “no”|flu = “no”)**= 1/2**P(headaches= “strong”|flu = “no”)=1/3**P(chills= “yes” |flu = “no”) = 1/4*and*P(chills= “no”|flu = “no”)**= 2/4**P(runny nose= “yes” |flu = “no”) = 1/5*and*P(runny nose= “no”|flu = “no”) = 2/3**P(fever= “yes”|flu = “no”) = 1/5*and*P(fever= “no”|flu = “no”) = 2/3*

## Other probabilities:

*P(flu = “yes”) = 5/8*and*P(flu = “no”) = 3/8**P(chills = “yes”) = 4/8*and*P(chills= “no”) = 4/8**P(headache= “mild”) = 3/8*;*P(headache= “strong”) = 3/8*and*P(headache= “no”) = 2/8**P(runny nose = “yes”) = 5/8*and*P(runny nose = “no”) = 3/8**P(fever = “yes”) = 5/8*and*P(fever= “no”) = 3/8*

## Example: To find out the likelihood of getting a flu given a runny nose (posterior) using Bayes Theorem

**P(flu= “yes”|runny nose= “yes”)**=

P(runny nose: “yes”|flu= “yes”)*P(flu: “yes”) /P(runny nose: “yes”)

= ((4/5)*(5/8))/(5/8)=4/5

## Now, we will predict whether for the given conditions we will have a flu or not

Let the above condition be X. We need to calculate:

- P(flu = “yes”|X) = P(X|flu = “yes”)* P(flu= “yes”)/P(X)
- P(flu = “no”|X) = P(X|flu = “no”)* P(flu= “no”)/P(X)

Since P(X) is common just ignore it.

**Calculating P(flu= “yes”|X) = 5/8 * (3/4* 1/3 * 2/3 * 1/3) = 0.0347 by multiplying the values given below:**

*P(flu = “yes”) = 5/8**P(chills= “yes”|flu = “yes”) = 3/4**P(runny nose= “no”|flu = “yes”) = 1/3**P(headaches= “mild”|flu = “yes”) = 2/3**P(fever= “no”|flu = “yes”) = 1/3*

The above values was calculated using the equation below:

P(X₁, X₂, X₃,…Xₙ| “yes”) = P(X₁|“yes”)*P(X₂|“yes”)*P(X₃|“yes”)….*P(Xₙ|“yes”)

Similarly we calculate for flu = “no”,

Calculating P(flu= “no”|X) = (3/8)*(1/4)*(2/3)*(1/3)*(2/3) = 0.0138

*P(flu = “no”) = 3/8**P(chills= “yes” |flu = “no”) = 1/4**P(runny nose= “no”|flu = “no”) = 2/3**P(headaches= “mild”|flu = “no”)**= 1/3**P(fever= “no”|flu = “no”) = 2/3*

Since, P(X|flu= “no”) < P(X|flu= “yes”), it is highly likely that a person will get a flu for the given set of conditions.

Oof that is a lot of calculations.

All the calculations we have done for finding conditional probability is done for categorical data. The above won’t apply for continuous values for obvious reasons. This is where we introduce gaussian naive bayes.

If we want to calculate likelihood of a point given a condition y =c :

- First find out rows which satisfy y=c
- calculate the mean and standard deviation for each feature or column for this given set of rows.
- For each data point in this set of rows, plug in the value of x (which is the feature value of that data point), the column’s mean and standard deviation in the expression given below.

- These values for each column is then multiplied to find out the likelihood of that data point by applying , P(X|Y=c) = P(X₁, X₂, X₃,…Xₙ| Y = c) = P(X₁|Y = c)*P(X₂|Y = c)*P(X₃|Y = c)….*P(Xₙ|Y = c)

where, X={X₁, X₂, X₃,…Xₙ} are the features/independent variables.