SERP • Strategic Education Research Partnership

How should students describe the center and spread of a data set?

The center of a data set is a single number that we can use to stand in for the whole data set. You can usually think of it as a "typical" value. We usually use mean or median for the center. The median is often a better "typical" value if the distribution is skewed (that is, grossly asymmetrical).

We use the center to stand in for the whole data set—but the values do not, in fact, all lie at the center. How far away is a typical value from the center? That's the spread. We typically use interquartile range (usually with median) or mean absolute deviation (MAD, usually with mean) as the spread. Students will learn about standard deviation (SD) in high school. You can read more about center and spread below.

A box plot (or box and whisker plot) is a good way to visualize a distribution of data, and it shows the median and interquartile range. The video below describes how to make a box plot.

More Box-Plot Details: Even Numbers of Points

If there are an even number of data points, there is no single "middle" value. In that case, the median is the mean of the two middle values. So the median of {1, 2, 3, 4} is 2.5, and the median of {2, 5, 5, 17} is 5.

People do not universally agree on the procedure for finding the first and third quartiles, Q1 and Q3. Q1 is the median of the lower half—but does that lower half include the "big" median or not? For this article, we say "no." This way, the points "in the box," where the values are in the range from Q1 to Q3 inclusive, will always include at least half of the data set.

More about Center and Spread

If you had to choose a single number to represent an entire distribution, you’d pick a number near the center. If the distribution is symmetric (or close to symmetric), the mean and median will be close to one another, and either one will work as a representative of the whole group. If the distribution is skewed, however, the mean and median will be different; in that case, the median is often closer to what we think of as a “typical” value.

You can think about it this way: suppose we have ten people. Nine of them have one kumquat each, and the tenth has 91 kumuats. The total is 100, so the mean number of kumquats is 10. We would probably not say that a "typical" person has 10 kumquats. Here, a typical person has the median: one.
Now is a good time to mention the mode: it's the most numerous value (one kumquat, in this case). Old-school stats curricula always mention "mean, median, and mode." But for a number of reasons, the mode turns out not to be so useful, so it's not in the Common Core. The term lingers, however, when we describe distributions. If a distribution has two humps, we call it bimodal.

In order to be quantitative about spread, we need a measure of spread as well. The Standards suggest that we use a single number; they choose Mean Absolute Deviation for symmetric distributions (which leads to Standard Deviation in a later grade) and the Interquartile Range for skewed distributions.

Interquartile Range (IQR): The interquartile range arises from the five-number summary, which is also the set of numbers you need to make a box plot. These numbers are the minimum, the first quartile (a.k.a. Q1), the median, the third quartile (Q3) and the maximum. The IQR is the difference between the third and first quartiles, or Q3 – Q1. The IQR is also, therefore, the width of the box in the box plot—the range that encompasses the middle half of the data. The upshot is that, if the data are spread out, the IQR is big; if the data are tightly clustered, the IQR is small.

Mean Absolute Deviation (MAD): The basic question behind this topic is, “How far is a typical point from the center?” You measure how far each point is from the mean (each point’s “deviation”), take the absolute value to make sure that distance is positive, then average those numbers. The result—the Mean Absolute Deviation—is a measure of spread. Just like the IQR, if the distribution is spread out, the MAD is big; if it’s tightly clustered, MAD is small.

Out in the world, the Standard Deviation (SD) is much more common as a measure of spread for symmetric distributions than the Mean Absolute Deviation. The principle is the same as with the MAD, but the details are tricky. To get positive numbers for distances you square the deviations; then, after you average them, you take the square root. Another issue is that when you average, you divide by n if you have all the data, but by n – 1 if you have only a sample. So we delay using the SD until a later grade.

What about outliers?

The standards include “describing any overall pattern and any striking deviations from the overall pattern with reference to the context in which the data were gathered.”

Data points that are “striking deviations” are called “outliers.” When you’re analyzing data, you will need to use your judgment to determine whether a point is in fact an outlier, and whether to include it in the data set or put it aside.

Suppose you were studying heights, and gathered a group of 10 random adults. Just by chance, the group includes Yao Ming (7’6”), or Peter Dinklage (4’5”). In the dot plot, as in real life, they would stand out.

If you were calculating the mean height or the MAD, either man would have a large effect. They would, in fact, mess up your calculation, giving you the wrong idea about the center and spread of the whole population. So to estimate the mean height of adults, you should omit the outlier.

Notice, though, that neither Ming nor Dinklage would affect the median or IQR much, if at all. So we say that the median and the IQR are “resistant” to outliers, while mean, MAD, and SD are “sensitive” to outliers.