LibGuides: Data Literacy: Analyzing Data

Analyzing Data

Measures of central tendency

A common technique used in understanding a dataset is describing with measures of central tendency.
The following measures are used to help summarize and understand the 'center' of a dataset.

Mean
- "Average"
Median
- Middle value
Mode
- Most Frequent

The Mean

Definition: The sum of all available observations divided by the total number of observations. Also known as the average.

Example: Students in a small class of 5 students got test scores 63, 74, 80, 85, and 91. Their mean or average score is: (63 + 74 + 80 + 85 +91)/5 = 393/5 = 78.6

Let’s look at an example drawn from social media. Suppose you look at five recent posts and see these numbers of likes: 20, 35, 25, 40, and 50. The mean would be calculated as:

Means or averages reduce a large amount of data to a single number, which can give us an excellent summary but obscure the whole picture.

Remember: in nearly all data sets, there are values far above and below the average!

Outlier: A value that is very different (far above or far below) to most others in the series. The farther an outlier value is from the mean, the greater its impact on it.

Spread: The distribution of data around the mean. Also called dispersion.

The greater the average absolute distance of observations to the mean, the greater the dispersion or spread.

The Median

Definition: The value exactly halfway up a ranked data series, or the average of the two central values if there are an even number of observations. In the test scores data set from the previous slide (63, 74, 80, 85, 91) the median score is 80.

Very large or small outliers have no more impact on the median than any other values, because they do not affect the ranking of cases. Let’s say the test scores were instead: 5, 74, 80, 85, 100. The median would still be 80.

Again, we can also look to an example from social media. Using the same five posts (ordered as 20, 25, 35, 40, 50), the median is 35, the middle number. This median isn’t affected if an unusual post gets 100 likes, whereas the mean would increase substantially.

Here is a bar chart illustrating the likes for the five social media posts, with the mean and median lines included. The green dashed line represents the mean (34), and the blue dashed line represents the median (35). This highlights how the mean and median offer slightly different perspectives on "typical" values in the dataset.

The Mode (Most Frequent Value)

Definition: The mode is the value that appears most frequently in a dataset. For an example, imagine these are the likes on five other posts: 10, 20, 20, 15, 25. Here, the mode is 20 since it appears more often than any other number.

This visualization provides a clear interpretation of the frequency of likes in each bin, with the mode (20 likes) highlighted by the orange dashed line. You can learn more about histograms further along in this guide.

Variable Types: NOIR

Nominal: No order or rank among values for the variable

Ordinal: The possible values have a clear sequence or hierarchy

Interval: Numerical data without a true zero (e.g., temperature in Celsius)

Ratio: Numerical data with a true zero (e.g., age, Height from ground level)

Variation in Data

The simple descriptive statistics described above are helpful, but cannot tell the whole story. Data values may be very close to the mean or median, but they might also be spread out such that most values are far from the mean and few are near it. In these instances, the mean and median may not accurately communicate the data. This is why we care about "variation" in data.

The two chars above both have a red line representing the mean of 7. Values can be close to the mean, or very far from it. We see that the chart on the right has a much greater spread, while the chart on the left has more closely clustered values. The vertical distance between a data point or individual value and the mean value is called the "residual." In these charts the residual is represented by the green arrows.

Standard Deviation

The standard deviation is a summary statistic that measures how dispersed the values of a variable are around the mean. It can be thought of as the "average" distance of the data points from the mean. Values that are further from the mean contribute more to the standard deviation. Calculating standard deviation involves multiple steps - subtracting the mean from each number, squaring the differences, averaging them, and taking the square root of this new average. But software like Excel makes this easy, with a built-in STDEV function. As you can see in the image, entering =STDEV with a data range automatically gives the standard deviation.

Outliers

An outlier is a data point that falls well outside the normal range of values. Outliers have a major impact on the mean, but less on other summary statistics. As the standard deviation is calculated by average numbers in relation to the mean, just one or two outliers can make the standard deviation much larger.

An example: five students take a test and their scores are {63, 74, 80, 85, 91} so the standard deviation is 10.74. This makes sense - the standard deviation is not that large as the values are clustered closely around the mean. Let's see what happens if we change one value to an extreme outlier of 5. Even though only this one score changed, the standard deviation jumps to 35.22! The single outlier value is so far from the mean that it makes the standard deviation over three times larger.

It's important to watch out for outliers when interpreting summary statistics. A few very high or low values can artificially inflate the standard deviation, making it seem like there is more deviation in data than is really present.

Quartiles and the Interquartile Range

We know that the spread of the values is important, and that outliers can have hidden impacts on summary statistics. One way to measure the spread of data is to divide the data into four equal "quartile" groups at specific points. The first quartile is the middle number of the lower half of the dataset. The second quartile is the median. The third quartile is the middle number of the upper half of the dataset.

The interquartile range, or "IQR", measures the distance between the third and first quartiles - it spans the middle 50% of the data. By subtracting the lower quartile from the upper quartile, you get the IQR. Compared to standard deviation, the IQR focuses on the middle data and won't be skewed by outliers. It's useful for describing the spread when you have outliers or skewed distributions.