*52*

The **Jaccard Similarity Index** is a measure of the similarity between two sets of data.

Developed by Paul Jaccard, the index ranges from 0 to 1. The closer to 1, the more similar the two sets of data.

The Jaccard similarity index is calculated as:

**Jaccard Similarity** = (number of observations in both sets) / (number in either set)

Or, written in notation form:

**J(A, B) = **|A∩B| / |A∪B|

If two datasets share the exact same members, their Jaccard Similarity Index will be 1. Conversely, if they have no members in common then their similarity will be 0.

The following examples show how to calculate the Jaccard Similarity Index for a few different datasets.

**Example 1: Jaccard Similarity **

Suppose we have the following two sets of data:

A = [0, 1, 2, 5, 6, 8, 9] B = [0, 2, 3, 4, 5, 7, 9]

To calculate the Jaccard Similarity between them, we first find the total number of observations in both sets, then divide by the total number of observations in either set:

**Number of observations in both:**{0, 2, 5, 9} = 4**Number of observations in either:**{0, 1, 2, 3, 4, 5, 6, 7, 8, 9} = 10**Jaccard Similarity:**4 / 10 = 0.4

The Jaccard Similarity Index turns out to be **0.4**.

**Example 2: Jaccard Similarity Continued**

Suppose we have the following two sets of data:

C = [0, 1, 2, 3, 4, 5] D = [6, 7, 8, 9, 10]

To calculate the Jaccard Similarity between them, we first find the total number of observations in both sets, then divide by the total number of observations in either set:

**Number of observations in both:**{} = 0**Number of observations in either:**{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10} = 11**Jaccard Similarity:**0 / 11 = 0

The Jaccard Similarity Index turns out to be **0**. This indicates that the two datasets share no common members.

**Example 3: Jaccard Similarity for Characters**

Note that we can also use the Jaccard Similarity index for datasets that contain characters as opposed to numbers.

For example, suppose we have the following two sets of data:

E = ['cat', 'dog', 'hippo', 'monkey'] F = ['monkey', 'rhino', 'ostrich', 'salmon']

To calculate the Jaccard Similarity between them, we first find the total number of observations in both sets, then divide by the total number of observations in either set:

**Number of observations in both:**{‘monkey’} = 1**Number of observations in either:**{‘cat’, ‘dog’, hippo’, ‘monkey’, ‘rhino’, ‘ostrich’, ‘salmon’} = 7**Jaccard Similarity:**1 / 7= 0.142857

The Jaccard Similarity Index turns out to be **0.142857**. Since this number is fairly low, it indicates that the two sets are quite dissimilar.

**The Jaccard Distance**

The **Jaccard distance** measures the *dissimilarity* between two datasets and is calculated as:

Jaccard distance = 1 – Jaccard Similarity

This measure gives us an idea of the difference between two datasets or the *difference* between them.

For example, if two datasets have a Jaccard Similarity of 80% then they would have a Jaccard distance of 1 – 0.8 = 0.2 or 20%.

**Additional Resources**

The following tutorials explain how to calculate Jaccard Similarity using different statistical software:

How to Calculate Jaccard Similarity in R

How to Calculate Jaccard Similarity in Python