LibGuides: Data Literacy: Variation in Data

Variation in Data

Variation in Data

Simple descriptive statistics are helpful, but cannot tell the whole story. Data values may be very close to the mean or median, but they might also be spread out such that most values are far from the mean and few are near it. In these instances, the mean and median may not accurately communicate the data. This is why we care about "variation" in data.

The two charts above both have a red line representing the mean of 7. Values can be close to the mean, or very far from it. We see that the chart on the left has a much greater spread, while the chart on the right has more closely clustered values. The vertical distance between a data point or individual value and the mean value is called the "residual." In these charts the residual is represented by the green arrows.

Standard Deviation

The standard deviation is a summary statistic that measures how dispersed the values of a variable are around the mean. It can be thought of as the "average" distance of the data points from the mean. Values that are further from the mean contribute more to the standard deviation. Calculating standard deviation involves multiple steps - subtracting the mean from each number, squaring the differences, averaging them, and taking the square root of this new average. But software like Excel makes this easy, with a built-in STDEV function. As you can see in the image, entering =STDEV with a data range automatically gives the standard deviation.

Outliers

An outlier is a data point that falls well outside the normal range of values. Outliers have a major impact on the mean, but less on other summary statistics. As the standard deviation is calculated by average numbers in relation to the mean, just one or two outliers can make the standard deviation much larger.

An example: five students take a test and their scores are {63, 74, 80, 85, 91} so the standard deviation is 10.74. This makes sense - the standard deviation is not that large as the values are clustered closely around the mean. Let's see what happens if we change one value to an extreme outlier of 5. Even though only this one score changed, the standard deviation jumps to 35.22! The single outlier value is so far from the mean that it makes the standard deviation over three times larger.

It's important to watch out for outliers when interpreting summary statistics. A few very high or low values can artificially inflate the standard deviation, making it seem like there is more deviation in data than is really present.

Quartiles and the Interquartile Range

We know that the spread of the values is important, and that outliers can have hidden impacts on summary statistics. One way to measure the spread of data is to divide the data into four equal "quartile" groups at specific points. The first quartile is the middle number of the lower half of the dataset. The second quartile is the median. The third quartile is the middle number of the upper half of the dataset.

The interquartile range, or "IQR", measures the distance between the third and first quartiles - it spans the middle 50% of the data. By subtracting the lower quartile from the upper quartile, you get the IQR. Compared to standard deviation, the IQR focuses on the middle data and won't be skewed by outliers. It's useful for describing the spread when you have outliers or skewed distributions.