Choose Your Sample Size Carefully

In my elementary statistics class we started the usual topics of confidence intervals and hypothesis testing. I decided to give a variation on a standard problem that wasn’t in the course text. In preparation for the usual calculations of standard deviation, standard error, constructing confidence intervals and doing some basic hypothesis testing, I asked my students to compute the sample mean of this data set — thirty simulated rolls of a standard ten-sided die.

$$[6, 4, 10, 3, 5, 4, 9, 5, 6, 5, 1, 3, 10, 4, 10, 1, 2, 2, 4, 1, 10, 9, 5, 5, 8, 10, 7, 3, 10, 3]$$

Now, the theoretical mean (\(\mu\)) is 5.5. As luck would have it, the sample data also has mean 5.5! And this is very annoying as it torpedoed some nice conversations I wanted to have about measurement error, rejection regions, etc. Fortunately, in the previous lecture we had live computer simulations of dice rolls and the students recognized that the observed mean could happen to match the theoretical mean. I was able to repair the hull and continue navigating the statistical waters as I had originally charted. (I’m done with the seafaring metaphors.)

I wasn’t counting on this happening. I could’ve rendered the probability of this happening to be zero had I just chosen a sample size of odd \(N\). But I chose thirty because that was the minimum textbook recommendation for using the normal distribution to do all the confidence interval construction gobble-dee goo (though in practice, I tend to want to use larger \(N\) before abandoning the \(t\)-distribution).

What then was the probability of me observing \(\mu = \bar{x}\) for \(N = 30\) rolls of a ten-sided die?

We can figure this out exactly through some slick combinatorics and algebra gymnastics.

First we recognize that there are \(10^{30}\) possible dice rolls. Next, we want to ask, “How many ways can we roll thirty, ten-sided dice such that their sum would be \(165\)?”

Generating Functions

Notice that $$g(x) = (x^{1} + x^{2} + \cdots + x^{10})^{30}$$ is a way to encode the number of ways of obtaining all possible sums rolling thirty, ten-sided dice. Notice that the coefficient of \(x^{30}\) represents the number of ways to roll a sum of \(30\). This happens only in one way — namely, all thirty dice have to show one. Since we want to know the number of ways to roll a sum of \(165\), we’re after the coefficient of \(x^{165}\). We would say that \(g\) is the generating function for the sum of rolling thirty, ten-sided dice.

Funambulism (look it up)

First, recognize that
\(\begin{eqnarray*}
(x^{1} + x^{2} + \cdots + x^{10})^{30} & = & x^{30}(1 + x^{1} + \cdots + x^{9})^{30}\\
& = & x^{30}\frac{(1-x^{10})^{30}}{(1-x)^{30}}
\end{eqnarray*}
\)

With the last equality coming as a result of the fact that $$(1-x)(1 + x + \cdots + x^{n-1}) = 1-x^{n}$$

Now, let $$f(x) = (1-x^{10})^{30}$$ and let $$h(x) = (1-x)^{-30}$$

We can expand \(f(x)\) as
$$(1-x^{10})^{30} = \sum_{r=0}^{30}(-1)^{r}{30 \choose r}x^{10r}$$

Similarly, we can expand \(h(x)\) as
$$(1 – x)^{-30} = \sum_{r=0}^{\infty}{r + 30 – 1 \choose r}x^{r}$$

Let, $$a_{10r} = (-1)^{r}{30 \choose r}$$ and let $$b_{r} = {r + 30 – 1 \choose r}$$

The coefficient of \(x^{165}\) of \(g(x)\) is the coefficient of \(x^{135}\) of \(f(x)h(x)\) (notice that \(g(x) = x^{30}f(x)h(x)\)).

How do we get this coefficient? We recognize that $$cx^{135} = a_{0}b_{135}x^{0}x^{135} + a_{10}b_{125}x^{10}x^{125} + \cdots + a_{130}b_{5}x^{130}x^{5}$$

or in more condensed notation and only caring about the coefficients

$$c = \sum_{r = 0}^{13}a_{10r}b_{135-10r}$$

Churning this through a few lines of Python code, we find that $$c = 25228791861003487965956082992$$ and thus, the probability that the thirty, ten-sided dice will produce a sum of 165 is (truncating for aesthetics) $$\frac{25228791861003487965956082992}{10^{30}} \approx 0.02522879$$

What a conveniently coincidental number for my stats class to chew on, even though it has nothing to do with confidence intervals.

2 thoughts on “Choose Your Sample Size Carefully

  1. Andy Novocin

    Life is like that isn’t it. Speaking as something that exists, I perceive the probability of my existence to be 100%. But a century ago the probability of my existence would have been tiny. It’s hard to imagine running the experiment again. If life could be reran then the odds of Andy Novocin might be tiny while the odds of a Novocin child of my generation, might be more reasonable.

    Reply
    1. Manan Shah Post author

      The life we experience is one of an uncountable number of simulations of “theory”. I wonder how stable the simulations are or do we just use randomness to explain that which we can’t accurately model?

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *