*56*

The **variance **is a way to measure how spread out data values are around the mean.

The formula to find the variance of a population is:

**σ ^{2}** = Σ (x

_{i}– μ)

^{2}/ N

where μ is the population mean, x_{i} is the i^{th} element from the population, N is the population size, and Σ is just a fancy symbol that means “sum.”

The formula to find the variance of a sample is:

**s ^{2}** = Σ (x

_{i}– x)

^{2}/ (n-1)

where x is the sample mean, x_{i} is the i^{th} element in the sample, and n is the sample size.

**Example: Calculate Sample & Population Variance in R**

Suppose we have the following dataset in R:

#define dataset data

We can calculate the **sample variance** by using the **var()** function in R:

#calculate sample variance var(data) [1] 46.01111

And we can calculate the **population variance **by simply multiplying the sample variance by (n-1)/n as follows:

#determine length of data n length(data) #calculate population variance var(data) * (n-1)/n [1] 41.41

Note that the population variance will always be smaller than the sample variance.

In practice, we typically calculate sample variances for datasets since it’s unusual to collect data for an entire population.

**Example: Calculate Sample Variance of Multiple Columns**

Suppose we have the following data frame in R:

#create data frame data #view data frame data a b c 1 1 2 6 2 3 4 6 3 4 4 7 4 4 5 8 5 6 5 8 6 7 6 9 7 8 7 9 8 12 16 12

We can use the sapply() function to calculate the sample variance of each column in the data frame:

#find sample variance of each column sapply(data, var) a b c 11.696429 18.125000 3.839286

And we can use the following code to calculate the sample standard deviation of each column, which is simply the square root of the sample variance:

#find sample standard deviation of each column sapply(data, sd) a b c 3.420004 4.257347 1.959410

*You can find more R tutorials here.*