Association rule mining | Apriori Algorithm

This blog is about the unsupervised learning algorithm: the Apriori algorithm, that is used by many supermarkets to increase their sales.

Deep Patel
5 min readJul 8, 2023
Market Basket Analysis (Source)

We all go to the supermarket to buy our favourites & necessary products. Also, we feel better if we get things easily, i.e. when we don’t need to search more for the products. You must have noticed that whenever you buy bread, you find either butter or egg around it, also with many other products. Have, you ever wondered how a supermarket does this? This is done with market basket analysis, and the algorithm followed in this process is Apriori Algorithm. This algorithm is based on Association Rule Mining. Supermarket usually groups the items that are bought together, so that people can easily buy them and in the end, their sales increase.

What is Association Rule Mining?

In simple words, association rule mining is based on IF/THEN statements. Association, as the name suggests, it finds the relationship between different items. And the best thing about this is, It also works with non-numeric or categorical datasets.

Association rule mining finds frequently occurring patterns between the given data. An association rule has two parts:

  • Antecedent (IF)
  • Consequent (THEN)

For eg: If a person in the supermarket buys bread, then it’s more likely that he will buy butter or jam or egg.

How Association Rule Mining Works?

Association rule mining basically calculates how frequently product X is bought when product Y is bought. Based on the concept of strong rules, Rakesh Agrawal, Tomasz Imieliński and Arun Swami introduced association rules for discovering regularities between products in large-scale transactions. Not for just two products, but it calculates the association for many products. Let’s first understand important terms :

Source: Author
  • Support — Support is an indication of the frequency of an item. Support of an item X is the ratio of the number of times X appears in the transaction to the total number of transactions. The greater the value of support the more frequently that item is bought.
  • Confidence — Confidence is a measure of how likely a product Y will be sold with product X i.e. X=>Y. It is calculated by the ratio of support (X U Y)(i.e. union) to the support of (X).
  • Lift — Lift is a measure of the popularity of an item or a measure of the performance of the targeting model. The Lift of Y is calculated by dividing confidence with the support of (Y).

Now, let’s understand all these terms by an example dataset.

Dataset (Source: Author)

We will find the value of support, confidence and lift from the formula discussed:

Support(wine) =Probability(X=wine) = 4(because wine is bought 4 times)/6(total transaction)

Confidence (X={wine, chips}) => (Y={bread}) = Support(wine, chips, bread) / Support(wine, chips)

i.e. Confidence = (2/6)/(3/6) = 0.667

Lift (X={wine, chips, bread}) => (Y={wine, chips}) = (Support(wine, chips, bread)) / (Support(wine, chips) * (Support(bread)))

i.e. Lift = (2/6) / ((3/6) * (4/6))

Now, at this point, we are defining the new term Conviction. The value of conviction tells the dependency between X & Y.

Conviction(X => Y) = (1-Support(y)) / (1-Confidence(X => Y))

  • Conv(x => y) = 1 means that x has no relation with y.
  • A high conviction value means that the consequent is highly dependent on the antecedent.

Now, let’s go step by step, applying all the things we learnt. First, create a frequency table of all items.

Frequency table from the above dataset (frequency out of 6) (Source: Author)
  • Step 1: Calculate the needed threshold support value. Then create the frequency(Support) table containing values > threshold:
Step1
  • Step 2: Make doublets or pairs of items and then calculate frequency. Remove all those pairs whose value is less than the threshold.
Step 2
  • Step 3: Make triplets of items and then calculate frequency. Also, remove those triples whose value is less than the threshold.
Step 3

Implementation with Python

# First check or install apyori
pip install apyori
# Import Dataset
import pandas as pd
data = pd.read_excel("Movie_reccommendation.xlsx")
# Convert dataframe into list - Because apyori algorithm tale list input rather than list of lists
#n is number of different items (bread, beer, ...)

observations = [] for i in range(len(data)):
observations.append([str(data.values[i,j]) for j in range(n)]
#Fitting data to algorithm
#These arguments
can be set by trying out different values and checking the association rules whether the arguments produced a valid association between items or not.
from apyori import apriori
associations = apriori(observations, min_length = 2, min_support = 0.2, min_confidence = 0.2, min_lift = 3)
#Viewing the results
print(associations[0])
#Viewing all results
for i in range(0, len(associations)):
print(associations[i])
#NOTE: Viewing the result and understanding the results is most important task.

Pros of the Apriori algorithm

  1. It is way too easy to implement and understands the algorithm.
  2. It can be used on large data sets.

Cons of the Apriori Algorithm

  1. Sometimes, it may need to find a large number of candidate rules which can be computationally expensive.
  2. When it comes to calculation, it’s too difficult especially when calculating support.

This was all about market basket analysis using the apriori algorithm.

--

--

Deep Patel
Deep Patel

Written by Deep Patel

Hi everyone, I am a software engineer at American Express. I am here to share my experience in tech.