I Learned Something!
Have you ever noticed how if you calculate the standard deviation of a set of numbers in Excel, you get an answer that's different from the average difference between the mean of the squares and the square of the mean? It's that (n-1) in the denominator instead of (n), and until now I never understood why it was there. I had heard all the hand-wavy explainations about how we "don't know the complete sample" and so "lose a degree of freedom", whatever that means, but never really got it.
Until now. And it's really simple, and doesn't require vague technical ideas about "degrees of freedom". So let's all explain it to non-statisticians like this from now on:
To estimate the variance, (let's call it s), what we want is the expectation of (x-m)^2, where m is the mean. If the mean is constant, you can see how linearity of the expectation gives that s=E(x^2)-m^2. However, if you don't know what the mean actually is, then m isn't a constant. You're estimating it in terms of the sample data - that is, the same data you're using to try to find the variance. Of course, you set m = Sum(x)/n, but then s=E((x-Sum(x)/n)^2)=E(x^2 + (1/n)x Sum(x) + (1/n^2) Sum(x) Sum(x) ). As you can see, you can no longer take the m out as a constant. Instead, you get all these cross product terms. If the x's are independent, then E(x_i x_j) = E(x_i) E(x_j) = m^2, as long as i is not j. But if i=j, then you get n terms that are...E(x^2)! These cancel out exactly one of the n E(x^2) terms that arise from the first term of the binomial, as well as removing one of the m^2 terms from the last second term of the binomial. Thus, (n-1).
My classmates all seem to have a better grasp of these things that I do. My statistics background, it turns out, it not so hot. I actually asked about the (n-1) during a recent class, and my classmate gave me the same vague "degrees of freedom" answer. I actually only learned the trick above by very persistently asking a stupid question. The proposed proof that we were discussing didn't work the way my classmates presented it because it involved scaling expectations of cross-product terms like those found in the above argument. I thought that the expression simplified easily because I wasn't thinking of the mean as a sum of the same variables we were manipulating. Despite being politely told I was wrong, I pressed the point until the TA finally pointed out this (obvious to everyone else) fact.
There are two important lessons: estimates from samples affect other statistics that measure them, and sometimes the only way around your ignorance is showcasing it.

0 Comments:
Post a Comment
Links to this post:
Create a Link
<< Home