The vast majority of us have been taught to compute an average of a data set by “adding up all the numbers in the data set and dividing that sum by the total number of numbers in the data set”. For example, to compute the average of \(\{1,2,3,6\}\) we would first find that \(1 + 2 + 3 + 6 = 12\) and then recognizing that there are four numbers in the data set, we would compute \(12 \div 4 = 3\) and thus, the average is \(3\).
In a condensed mathematical notation, if we let \(x_{i}\) represent the \(i^{th}\) data point in a data set of \(N\) points, then the average (also called the arithmetic mean) is written as $$\bar{x} = \frac{\sum_{i = 1}^{N}x_{i}}{N}$$
So the question is, “Why Do We Calculate Average The Way We Do”? The answer can be a bit complicated, but in short, we don’t have to calculate it this way. In fact, depending on the type of data we have, its average can be computed differently. The grade school formula we were taught works in a certain set of cases, and typically those cases are what the majority of people tend to deal with on a regular basis.
This formula
$$\bar{x} = \frac{\sum_{i = 1}^{N}x_{i}}{N}$$
does something special, geometrically. The trick to understanding what’s going on is to ask the following question using this numerical example:
“Is there one number such that the total `distance’ from that number to all the numbers in the data set (where my data set is \(\{1,2,3,6\}\)) is minimal?”
Now, answering this question requires us to first agree on how we want to measure distance. It sounds like a simple question, but resolving how we want to compute distance is at the heart of understanding why “average” is the way it is.
Enter Pythagoras
Pythagoras?! Yes, Pythagoras, as in the formula we all have been told to know by heart \(a^{2} + b^{2} = c^{2}\). We know that this formula is supposed to relate the length of the two legs of a right triangle (\(a\) and \(b\)) to the hypotenuse, \(c\) (the longest side of the triangle). Thus, once we know \(c^{2}\) we have \(c\) via a square root.
We’ve also had burned into our memories things like “3, 4, 5” triangles because \(3^{2} + 4^{2} = 5^{2}\). But now I digress. What do triangles have to do with averages? Actually, nothing. It’s the Pythagorean Formula (\(a^{2} + b^{2} = c^{2}\)) that is the key to all this.
This formula is how we want to measure distance. Specifically, \(c = \sqrt{a^{2} + b^{2}}\).
A Numerical Example
Let’s consider our data set, \(\{1,2,3,6\}\). We computed the average to be \(3\). Now, and humor me here, let’s say we wanted to know the “Pythagorean distance” of our data points to our average. To find this, we simply have to extend the Pythagorean Formula, sum the square distances, and take a square root. In other words, we want to do this:
$$\begin{eqnarray*}
(1-3)^{2} + (2-3)^{2} + (3-3)^{2} + (6-3)^{2} & = & 4 + 1 + 0 + 9\\
& = & 14\\
& = & c_{\bar{x}}^{2}
\end{eqnarray*}$$
Thus, our distance is \(c_{\bar{x}} = \sqrt{14} \approx 3.741657\). To keep the numbers “friendly”, we’ll stick with \(c_{\bar{x}}^{2} = 14\).
Here is the interesting part. What would \(c^{2}\) be, if we chose a different estimate for our average. For example, what would \(c^{2}\) be if we thought that our average should be \(4\) (we’ll call this \(c_{4}^{2}\))? Well, here’s what we’d get.
$$\begin{eqnarray*}
(1-4)^{2} + (2-4)^{2} + (3-4)^{2} + (6-4)^{2} & = & 9 + 4 + 1 + 4\\
& = & 18\\
& = & c_{4}^{2}
\end{eqnarray*}$$
Notice that \(c_{4}^{2} > c_{\bar{x}}^{2}\) since \(18 > 14\).
How about \(c_{2}\)?
$$\begin{eqnarray*}
(1-2)^{2} + (2-2)^{2} + (3-2)^{2} + (6-2)^{2} & = & 1 + 0 + 1 + 16\\
& = & 18\\
& = & c_{2}^{2}
\end{eqnarray*}$$
And \(c_{2}^{2} > c_{\bar{x}}^{2}\)
So a challenge then to the reader, is it possible to find a value \(v\) such that \((1-v)^{2} + (2-v)^{2} + (3-v)^{2} + (6-v)^{2} < 14\)? And the answer is, no.
The average, computed as
$$\bar{x} = \frac{\sum_{i = 1}^{N}x_{i}}{N}$$
minimizes the sum of the square differences.
In fact, the reader should pick as many values of \(v\) as he/she wants and for each value of \(v\) compute \((1-v)^{2} + (2-v)^{2} + (3-v)^{2} + (6-v)^{2}\). The average will always give the smallest sum.
A Proof
We would like to find a value \(v\) such that $$\sum_{i = 1}^{N}(x_{i}-v)^{2}$$ is minimized.
First, write $$\sum_{i = 1}^{N}(x_{i}-v)^{2}$$ as
$$\sum_{i=1}^{N}(x_{i}^{2} – 2vx_{i} + v^{2})$$
which we will call \(f(v)\) and write it as
$$f(v) = Nv^{2} – 2v\sum_{i=1}^{N}x_{i} + \sum_{i=1}^{N}x_{i}^{2}$$
Now, we recognize that \(f(v)\) is a parabola in \(v\) whose global minimum occurs at its vertex (since \(N > 0\)). The vertex of a parabola written as \(ax^{2} + bx + c\) is given as \(x = \frac{-b}{2a}\). In our case, \(a = N, b = -2\sum_{i=1}^{N}x_{i}, c = \sum_{i=1}^{N}x_{i}^{2}\). Thus, \(f(v)\) is minimized at \(v = \frac{-b}{2a} = \frac{\sum_{i=1}^{N}x_{i}}{N}\). Notice that this is just the formula for average (arithmetic mean).
Summary
Here is basically what you should take away from this post.
- The arithmetic mean is not an arbitrary metric.
- The arithmetic mean minimizes a specific distance measure.
- If you are using the arithmetic mean as a statistic to report, ask yourself if this makes sense. You may be more interested in a different type of average if you are minimizing a different distance measure. For example, search for “geometric mean” to get an idea of another type of “average” and where it is used.
- For the reader who remembers or has had exposure to statistics, our Pythogrean distance measure is very closely tied to “standard deviation” and “variance”. Recall that the population variance is just
$$\mbox{var} = \sigma^{2} = \frac{\sum_{i=1}^{N}(x_{i} – \mu)^{2}}{N}$$ where \(\mu\) is the population mean. Notice the similarity? The division by \(N\) is just “averaging” the total square distance (the numerator). Standard deviation is just \(\sqrt{\mbox{var}} = \sigma\). - When we talk about a “line of best fit”, we are in effect doing the same thing that was done here. The arithmetic mean is the “constant of best fit”.
One question you might ask is, “What if instead of squaring the ‘distances’, we just found their absolute value, in order to ensure they are all positive? What statistic might minimize the sum of the absolute distances?” In equation form, what value (or values) of \(z\) would minimize \(\sum_{k=1}^{N}|{x_{k}-z}|\)
I hope that TeX worked.
Yes, absolute differences seems to often be a natural choice to consider when we want to find “total error”. Here’s what happens:
Let \(\{x_{k}\}_{k=1}^{N}\) be your \(N\) data points. Now if there is an ordering defined then let \(\{y_{k}\}_{k=1}^{N}\) be the \({x_{k}}\) sorted. That is \(y_{1}\) is the smallest value in \({x_{1}, x_{2}, \ldots, x_{N}}\) and \(y_{N}\) is the largest value.
Now, if \(z \leq y_{1}\) then \(E_{1} = \sum_{k=1}^{N}|y_{k} – z| = (\sum_{k=1}^{N}y_{k}) – Nz\) and we can see that difference is positive and minimized when \(z = y_{1}\). Similarly if \(z \geq y_{N}\) then the corresponding difference (call it \(E_{N}\)) is minimized when \(z = y_{N}\). So can we do better? Is there a \(z\) such that \(y_{1} < z < y_{N}\) that does better than \(z = y_{1}\) or \(z = y_{N}\)? Suppose that \(z\) is so that \(y_{1} < z < y_{N}\). Then \(\exists i\) with \(1 < i < N\) such that \(y_{k} < z\) for \(k < i < N\). So, we have \(E = \sum_{k=1}^{i}(z - y_{k}) + \sum_{k = i+1}^{N}(y_{k} - z)\). Now, clearly \(E\) does better than \(E_{1}\) and \(E_{N}\). Expanding \(E\) and rearranging a bit we have \(E = iz - (N-i)z -\sum_{k=1}^{i}y_{k} + \sum_{k = i+1}^{N}y_{k} = 2iz - Nz -\sum_{k=1}^{i}y_{k} + \sum_{k=i+1}^{N}y_{k}\). So what should \(i\) be then? Well, \(i = \frac{N}{2}\) seems to remove \(z\) altogether and we're left with \(E = -\sum_{k=1}^{i}y_{k} + \sum_{k=i+1}^{N}y_{k}\). So what does that make \(z\)? It makes \(z\) the median! For example, if \(N\) even then \(i = \frac{N}{2}\) and thus, \(y_{\frac{N}{2}} \leq z \leq y_{\frac{N}{2}+1}\). As a best guess, then \(z\) is taken to be the average of \(y_{\frac{N}{2}}\) and \(y_{\frac{N}{2}+1}\). For \(N\) odd \(i = \frac{N-1}{2} + 1\).
So are you suggesting that we run sums of squares of differences in our head? My mental version of what you’re suggesting is to ball park a guess then move values from high numbers to low numbers. I visualize this as columns of blocks that I need to level. When the blocks are leveled then I have my average:
******
***
**
*
The values 2 and 1 need to steal 1 and 2 spare blocks, respectively, from the 6 block row to hit height 3.
*** (- ***)
*** (+ 0)
*** (+ *)
*** (+ **)
Ha! No, I’m not suggesting that we do that computation mentally. I thought it would be an instructive exercise if the reader tried values of \(v\) other than the average and then compute the sums of squares of differences.
By the way, the mental version you give is the minimization of \(\Big|\sum_{k=1}^{N}(x_{k} – v)\Big|\). The \(v\) that minimizes this is \(\frac{\sum_{k=1}^{N}x_{k}}{N}\).