Naïve Bayes Classifier
Introduction to naive bayes classifier with methods to tune the algorithm and improve the result.
Naive + Bayes: the name of the algorithm itself has so much to tell. This algorithm is called naive because it assumes that each feature is independent of others and that is unrealistic for real data. Bayes refer the statistician Thomas Bayes of the 18th century who gave the famous Bayes’ theorem to the world. It is a supervised classification algorithm.
What is Bayes theorem?
Bayes’ theorem in simple terms tells us how two events are connected to each other. For example, your probability of getting a parking space is connected to the time of day you park, where you park, and what conventions are going on at any time.
Lets look at the equation for Bayes Theorem,
Where,
- P(A|B) is the probability of hypothesis A given the data B. This is called the posterior probability.
- P(B|A) is the probability of data B given that the hypothesis A was true.
- P(A) is the probability of hypothesis A being true (regardless of the data). This is called the prior, i.e. you believed this is true before you saw the evidence .
- P(B) is the probability of the data (regardless of the hypothesis). This is basically a normalising constant
If you are thinking what is P(A|B) or P(B|A)?These are conditional probabilities having formula:
If you still have confusion,this image summarizes Bayes Theorem-
Naïve Bayes is a statistical classification technique based on Bayes Theorem. It is one of the simplest supervised learning algorithms. Naïve Bayes classifier is the fast, accurate and reliable algorithm. Naïve Bayes classifiers have high accuracy and speed on large datasets.
How does Naive Bayes Algorithm works?
Naïve Bayes is a statistical classification technique based on Bayes Theorem. It is one of the simplest supervised learning algorithms. Naïve Bayes classifier is the fast, accurate and reliable algorithm. Naïve Bayes classifiers have high accuracy and speed on large datasets.
P(y / f1f2f3…fn) = {P(f1/y).P(f2/y)…P(fn/y) * P(y)} / P(f1)…P(fn)
here, y
is the label (output) and f(1,2…n) is list of features,
P(yes) = P(yes/fn) / {P(yes/fn) + P(no/fn)}
Still didn’t understood the working, Don’t worry, see following example:
Given an example of weather conditions and playing sports. You need to calculate the probability of playing sports. Now, you need to classify whether players will play or not, based on the weather condition.
Below I have a training data set of weather and corresponding target variable Play
(suggesting possibilities of playing). Now, we need to classify whether players will play or not based on weather condition. Let’s follow the below steps to perform it.
- Step 1: Calculate the prior probability for given class labels
- Step 2: Find Likelihood probability with each attribute for each class
- Step 3: Put these value in Bayes Formula and calculate posterior probability.
- Step 4: See which class has a higher probability, given the input belongs to the higher probability class.
Problem: Players will play if weather is sunny. Is this statement is correct?
We can solve it using above discussed method of posterior probability.
P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
Here we have P (Sunny |Yes) = 3/9 = 0.33
, P(Sunny) = 5/14 = 0.36
and P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60
, which has higher probability.
Naïve Bayes uses a similar method to predict the probability of different class based on various attributes. This algorithm is mostly used in text classification and with problems having multiple classes.
How to build a basic model using Naïve Bayes in Python ?
Again, scikit learn (python library) will help here to directly build a Naïve Bayes model in Python. There are three types of Naïve Bayes model under the scikit-learn library:
- Gaussian: It is used in classification and it assumes that features follow a normal distribution.
- Multinomial: It is used for discrete counts. For example, let’s say, we have a text classification problem. Here we can consider Bernoulli trials which is one step further and instead of “word occurring in the document”, we have “count how often word occurs in the document”, you can think of it as “number of times outcome number x_i is observed over the n trials”.
- Bernoulli: The binomial model is useful if your feature vectors are binary (i.e. zeros and ones). One application would be text classification with ‘bag of words’(Natural Language Processing algorithm will be covered in future blogs) model where the 1s & 0s are “word occurs in the document” and “word does not occur in the document” respectively.
# importing required librariesimport pandas as pdfrom sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNBfrom sklearn.metrics import accuracy_score#You must select model according to the need of dataset
model = GaussianNB()
clf = MultinomialNB()
clf = BernoulliNB()# fit the model with the training data
model.fit(train_x,train_y)# predict the target on the train dataset
predict_train = model.predict(train_x)
print('Target on train data',predict_train)# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test)# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)
Naïve Bayes is simple, intuitive, and yet performs surprisingly well in many cases on a large range of complex problems. For example, spam filters Email app uses are built on Naïve Bayes.
Its assumption of feature independence, and its effectiveness in solving multi-class problems, makes it perfect for performing Sentiment Analysis. Sentiment Analysis refers to the identification of positive or negative sentiments of a target group (customers, audience, etc.)
Tips to improve the power of Naïve Bayes Model
Here are some tips for improving power of Naïve Bayes Model:
- If continuous features do not have normal distribution, we should use transformation or different methods to convert it in normal distribution.
- If test data set has zero frequency issue, apply smoothing techniques “Laplace Correction” to predict the class of test data set. Laplacian correction is one of the smoothing techniques. Here, you can assume that the dataset is large enough that adding one row of each class will not make a difference in the estimated probability. This will overcome the issue of probability values to zero.
- Remove correlated features, as the highly correlated features are voted twice in the model and it can lead to over inflating importance.
- Naïve Bayes classifiers has limited options for parameter tuning like alpha=1 for smoothing, fit_prior=[True / False] to learn class prior probabilities or not and some other options (look at detail here). I would recommend to focus on your pre-processing of data and the feature selection.
- You might think to apply some classifier combination technique like ensemble, bagging and boosting but these methods would not help. Actually, “ensemble, boosting, bagging” won’t help since their purpose is to reduce variance. Naïve Bayes has no variance to minimize.
Conclusion
Congratulations, you have made it to the end of this tutorial!
In this tutorial, you learned about Naïve Bayes algorithm, it’s working, Naïve Bayes assumption, issues, implementation, advantages, and disadvantages. Along the road, you have also learned model building and evaluation in scikit-learn for binary and multinomial classes.
Naïve Bayes is the most straightforward and most potent algorithm. In spite of the significant advances of Machine Learning in the last couple of years, it has proved its worth. It has been successfully deployed in many applications from text analytics to recommendation engines.
Thats all for this blog. Have a Great Day!