Connect The Dots — Standard Deviation

In most introductory probability and statistics courses, students are taught three measures of central tendency (mean (arithmetic), median, mode) and one measure of “spread”, namely standard deviation. The Greek letter lowercase sigma, \(\sigma\), is typically used to represent population (theoretical) standard deviation and the Latin alphabet letter \(s\) is used to represent sample (empirical) standard deviation. The formula, as it were, for standard deviation is given as the square of variance and variance is represented as \(\sigma^{2}\) or \(s^{2}\) depending on the data (theoretical vs estimated).

One of the things that I see with students who have taken a basic probability and statistics course (typically, a non-Calculus based one) is that they are required to memorize the formula for computing standard deviation of a sample of data. And the formula they are given for variance looks something like this:
$$s^{2} = \frac{n\sum_{i=1}^{n}x_{i}^{2} – \Big(\sum_{i=1}^{n}x_{i}\Big)^{2}}{n(n-1)}$$

Holy hell. For most students, this is panic inducing. They have to memorize that?? For all intents and purposes that formula is completely arbitrary.

From here, the math misery continues because they are they given a set of steps to follow in order to compute standard deviation. These steps go something like this.

  1. Add all the \(x_{i}\) and then square that sum.
  2. Square each \(x_{i}\) and sum the squares.
  3. Multiply the sum of squares by \(n\).
  4. Subtract the square of the sum from the \(n\) times the sum of the squares.
  5. Divide by \(n(n-1)\).
  6. Finally take the square root to get \(s\).

Then there are probably a few worked out examples showing the steps. But this makes no sense!! Why is that ugly disaster above a measure of spread?? You certainly can’t tell by looking at the formula.

Damn It, It’s Pythagoras

This is the formula that students should learn when introduced to variance and standard deviation.

$$s^{2} = \frac{(x_{1}-\bar{x})^{2} + (x_{2}-\bar{x})^{2} + \cdots + (x_{n}-\bar{x})^{2}}{n-1}$$

Most students by the time they’ve taken an introductory probability and statistics course, have already had some experience with coordinate geometry and the Pythagorean theorem. Namely, the oft recited formula (regardless of the presence of a picture of a right triangle) is $$a^{2} + b^{2} = c^{2}$$

And this is exactly how we ought to first introduce variance — by appealing to the Pythagorean theorem.

We have to talk about distance and we ought to give our students a glimpse of how the Pythagorean theorem extends to more than two dimensions. The easiest way to convince them is probably to either assert or show that if we had a box and wanted to find the length of the long diagonal, then that length would be given by $$x^{2} + y^{2} + z^{2} = c^{2}$$ with \(x,y,z,c\) having the ‘conventional’ meaning.

From here, we can move to a three-dimensional Cartesian coordinate system and show that the square distance from a point \(x,y,z\) to another (reference) point \(x_{0},y_{0},z_{0}\) is just
$$d^{2} = (x-x_{0})^{2} + (y-y_{0})^{2} + (z-z_{0})^{2}$$

and huzzah! That is starting to look like the numerator of \(s^{2}\) (the second version). All we need to do is get students to buy in to the \(n\)-dimensional extension of the Pythagorean theorem as
$$d^{2} = (x_{1}-\hat{a_{1}})^{2} + (x_{2}-\hat{a_{2}})^{2} + \cdots + (x_{n}-\hat{a_{n}})^{2}$$

Convincing here is often not too hard. For the introductory course, we can just assert that it is so. In more advanced courses, we may want to talk about hyperboxes or simplexes. And we can give this a more general name — Euclidean distance. But anyway that’s not the point.

The point is once we have \(d^{2}\) as above, we can start the discussion about spread.

Spread

Now, too often, I’ve heard that teachers simply skip the discussion about spread altogether and march towards computation. But this misses the point completely. We have a great chance to talk about how to look at our data.

Here we can connect some dots because typically earlier in the course there was likely a discussion about interquartile range (IQR). So, a good lead off question is simply, “How should we measure spread in our data?”. Odds, are students will come up with “the difference between the largest and smallest value”. This is a good measure of spread and we can toss up a sample data to show what some of the limitations are. Consider for example, the following:
$$[1,2,2,2,2,2,2,2,2,100]$$

Here the spread is 99. Is that a good measure? The average happens to be 11.7 and from here we can have a debate. Some will say that the 99 is not meaningful, while others will say the opposite. There’s not a wrong answer here. A data set with \(N=10\), a range of 99, and an average of 11.7 does tell me something. It could mean that my data points are either clumped with an outlier or there’s bimodality. For example, this data set also has an average of 11.7 with a range of 99.

$$[1,1,1,1,1,1,1,1,9,100]$$

But so do this one
$$[-49,1,1,1,1,26,26,26,34,50]$$

So in some sense, this “maximum distance” metric isn’t telling a large enough story. Maybe there’s another measure? And here the instructor can begin to nudge / hint about thinking about total distance. Or even just point back to the earlier work with Pythagoras and coax out or just state, “what about distance from the ‘center’ of the data?”.

And again, we can meaningfully discuss what the center is. If we want to hand wave a bit and just say, “for a lot of technical reasons the central measure we want to consider in this class is the average.”, that’s not a problem.

And now, the second to last step, we revisit Pythagoras and replace our “reference point” by \((\bar{x},\bar{x},\ldots,\bar{x})\) to give

$$d^{2} = (x_{1}-\bar{x})^{2} + (x_{2}-\bar{x})^{2} + \cdots + (x_{n}-\bar{x})^{2}$$

There should be no real leap that students have to take to understand this equation as just a symbolic variation of

$$d^{2} = (x_{1}-\hat{a_{1}})^{2} + (x_{2}-\hat{a_{2}})^{2} + \cdots + (x_{n}-\hat{a_{n}})^{2}$$

And now for the coup de grace. Our equation that we’ve teased out is
$$d^{2} = (x_{1}-\bar{x})^{2} + (x_{2}-\bar{x})^{2} + \cdots + (x_{n}-\bar{x})^{2}$$

But it’s total square distance. We can again appeal to reason and state that, well, if we had more points, odds are we’d have greater total distance because we’re adding more non-negative terms. So really, what we’d like to do is smooth things out a bit and get an average square distance (if you’re thinking something, hold yer horses). We could then go with

$$c^{2} = \frac{(x_{1}-\bar{x})^{2} + (x_{2}-\bar{x})^{2} + \cdots + (x_{n}-\bar{x})^{2}}{n}$$

And once again, we can jump a little and argue that dividing by \(n\) may not be the best thing to do. Instead we should divide by \(n-1\). We can reason like so: in order to compute \(d^{2}\) we needed to use our original data to compute \(\bar{x}\) and then we needed our original data again to compute square differences from \(\bar{x}\). Thus, we should pay a penalty of “one data point”. Or we could argue that since we have \(n\) data points and we have computed \(\bar{x}\), then we have, in some sense, “too much information” because we could lose a data point and be able to recover it from knowledge of \(\bar{x}\). Similarly, if we also knew \(d^{2}\), then we could lose a second data point and be able to recover it. Thus, rather than our divisor be \(n\) for \(d^{2}\) we ought to make it \(n-1\). Doing so gives

$$s^{2} = \frac{(x_{1}-\bar{x})^{2} + (x_{2}-\bar{x})^{2} + \cdots + (x_{n}-\bar{x})^{2}}{n-1}$$

Now, there’s at least a reason why the formula is the way it is, why it a measure of spread, and what it is a spread against. One of the hopes, then, is that we’re not subjecting our students to memorize an unwieldy and unintuitive formula. We can then present the ugly formula and explain that it is an algebraic reworking of our \(s^{2}\).

If you were holding your horses, it may have been because the observant student may ask “why not take the square root first and then divide by \(n\), isn’t that average distance?”. And it’s a perfectly valid question to which, if we were so inclined and prepared, could give a mini-hand-waving type explanation about the \(L^{2}\) metric. Or we can do what most textbooks do: “the discussion is beyond the scope of this class, so just trust.”. 🙂

If the class is sufficiently advanced, we can even delve into why the distance ought to be from the average.

In any case, what we’ve done is connected a few dots together or at least shown that the material is not completely disjointed from other things they’ve learned.

Leave a Reply

Your email address will not be published. Required fields are marked *