In statistics, we’re often interested in understanding how a dataset is distributed. In particular, there are four things that are helpful to know about a distribution:
- Is the distribution symmetrical or skewed to one side?
- Is the distribution unimodal (one peak) or bimodal (two peaks)?
- Are there any outliers present in the distribution?
- What is the mean, median, and mode of the distribution?
- What is the range, interquartile range, standard deviation, and variance of the distribution?
SOCS is a useful acronym that we can use to remember these four things. It stands for “shape, outliers, center, spread.”
Let’s walk through a simple example of how to use SOCS to describe a distribution.
Example: How to Use SOCS to Describe a Distribution
Suppose we have the following dataset that shows the height of a sample of 20 different plants.
Here is how we can use SOCS to describe this distribution of data values.
First, we want to describe the shape of the distribution.
One helpful way to visualize the shape of the distribution is to create a histogram, which displays the frequencies of every value in the dataset:
Is the distribution symmetrical or skewed to one side? From the histogram, we can see that the distribution is roughly symmetrical. That is, the values aren’t skewed to one side or the other.
Is the distribution unimodal (one peak) or bimodal (two peaks)? The distribution is unimodal. It has one peak at the value “7.”
Next, we want to determine if there are any outliers in the dataset. From the histogram, we can visually inspect the distribution and see that 22 is potentially an outlier:
One common way to formally define an outlier is any value that is 1.5 times the interquartile range above the third quartile or below the first quartile.
Using the Interquartile Range Calculator, we can input the 20 raw data values and find that the third quartile is 9, the interquartile range is 3, and thus any value above 9 + (1.5*3) = 13.5 is an outlier, by definition.
Since 22 is greater than 13.5, we can declare 22 to be an outlier.
Next, we want to describe where the center of the distribution is located. Three common measures of central tendency we can use are the mean, median, and the mode.
Mean: This is the average value in the distribution. We find this by adding up all of the individual values, then dividing by the total number of values:
Mean = (8+4+6+7+7+6+7+8+6+11+8+22+10+9+9+7+5+7+6+4) / 20 = 7.85
Median: This is the “middle” value in the distribution. We find this by arranging all of the values from smallest to greatest, then identifying the middle value. This turns out to be 7.
4, 4, 5, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 9, 9, 10, 11, 22
Mode: This is the value that occurs most frequently. This turns out to be 7.
Next, we want to describe how spread out the values are in the distribution. Four common measures of dispersion we can use are the range, interquarile range, standard deviation, and the variance.
Range: This is the difference between the largest and smallest value in the dataset. This turns out to be 22 – 4 = 18.
Interquartile Range: This measures the width of the middle 50% of the data values. From inputting the 20 raw data values into the Interquartile Range Calculator, we can see that this is equal to 3.
Standard Deviation: This is a measure of how spread out the data values are, on average. From inputting the 20 raw data values into the Variance and Standard Devation calculator, we can see that the standard deviation is equal to 3.69.
Variance: This is simply the standard deviation, squared. This is equal to 3.692 = 13.63.
From using SOCS as a guide, we were able to describe the distribution of plant heights in the following manner:
- The distribution was unimodal and symmetrical, meaning it only had one peak and it was not skewed to one side or the other.
- The distribution had one outlier: 22.
- The distribution had a mean of 7.85, a median of 7, and a mode of 7.
- The distribution had a range of 18, an interquartile range of 3, a standard deviation of 3.69, and a variance of 13.63.
Note that we can use SOCS to describe any distribution, which is a helpful way for us to gain a good understanding of the shape of a distribution, if it has any outliers, where the center is roughly located, and how spread out the data values are.