Saturday, March 21, 2009

Statistics (again)

I have, in the past, on this blog commented on statistics. There is a good reason for that: statistics are everywhere.

Statistics are vital to science, politics, economics, ...

Mathamatically statistics can be very simple to very complex. The simpler the analysis the less wiggle room for out and out lying about results (a standard deviation is what it is and if it corresponds to 30% of the value it is not too good). This gets us to political polling. We've heard lots of it and will hear lots more. The math behind the various modeling parameters and resultant presented data is all good and valid. The selected model and any extrapolation to the motivation behind behavior on the other hand...

And now the trouble. It is impossible to perfectly predict behavior of a hydrogen molecule (H2). Billions and trillions of molecules, however, tend to behave. People are somewhat more complicated than hydrogen (and arguably less explosive, but I'll reserve judgment on that), but are not available in sufficiently large and uniform numbers that large number treatment is relevant.

In general 1/sqrt(#) gives an estimate of the deviation from the mean, which is often the stated error in the absence of an error that can be propagated for a system. So 10000 sample points corresponds to 1% error. In science 10k is a small number and 1% dev from the mean is huge. In polling people it is a huge number. Often ~1000 (or ca. 3% error) is used. The problem is that that 3% error assumes that the sampling is random across all major differences. The 1000 must be completely random, but then some race, or religion or region or...may be undersampled invalidating the resulting stated error. This means that the "random" sample must have rules associated with it that preclude true randomness. Often times statisticians use post sampling corrections. That is, a random sample is used, but then various models are used to variably extrapolated different subgroups to their representation in the general population. Those models, of course, are quite variable. In the end this means that the stated % error is pretty much crap. The only validity it holds in practice is if there is sufficient past experience by a particular statistician to lend validity. Of course that is meaningless if something fundamental changes from past samples.

I'm fairly certain I lost sense in there somewhere so I'm posting this for now. May correct later.

No comments: