I think because it's called "standard deviation" that it sounds like the thing to use or look for. It sounds more correct because of the word standard.
I feel like it is the same kind of failing due to human perception of language that programmers have with the idea of exceptions and errors, especially the phrase "exceptions should only be used for exceptional behaviors". That's a cool phrase, but people latch on to it because of the word exception sounding like something extremely rare and out of the ordinary whereas we see errors as common, but they are in fact the same thing. Broke is broke, it doesn't matter what you call it, but thousands of programmers think differently because of the name we gave it.
We are human and language absolutely plays a role in our perception of things.
> I think because it's called "standard deviation" that it sounds like the thing to use or look for.
Yes! Because it's an awesome trick and lets you do good estimates on napkins.
The other day I was buying lunch at a food cart and thought about how much change the food carts had to carry, as a function of how many customers they have, under the assumption that they want to be able to provide correct change to 99% of their customers.
Let's say that the average amount of change a customer needs is $5, and a 99-th percentile customer needs $15 in change. If we pretend that the distribution is approximately Gaussian we can calculate that 1,000 food carts with 1 customer each would need $15,000 in change, but 1 food cart with 1,000 customers would need $5 x 1,000 + ($15 - $5) * sqrt(1,000) ≈ $5,320. That's math you can do in your head without a calculator (being a programmer, 1,000 ≈ 2^10 so sqrt(1,000) ≈ 2^5).
The standard deviation and assumptions of normality are so useful because of the central limit theorem. That is, if you have many iid variables which have finite standard deviation the sum will converge to a Gaussian distribution as the number of variables increases.
Then you say "Well, the standard deviation weighs the tail too heavily" and the response is "well use higher order moments then, that's what they're they're for".
It's a neat math trick, but it seems more accurate to say this lets you calculate bad estimates on the back of a napkin. Unless you really think food carts carry $5000 in change.
The quantitative work I do has to do with measuring latency, where the minimum, median, 90%, and 99% values are more meaningful than the mean or standard deviation. Programs typically have a best-case scenario (everything cached) and a long one-sided tail.
Saying that it's silly to think that food carts have $5,000 in change is unproductive because I was illustrating how the calculation works, not how the economics of food carts works, and the numbers I used for illustrative purposes were not intended to reflect reality. (1,000 customers in a day? Not likely, my guess is 200 for the busiest food carts.)
But it's good to have bad estimates, at least, it's better to have bad estimates than it is to have no estimates at all. I'm not saying that standard deviation is a substitute for more thorough analysis, just that standard deviation is an improvement over just talking about the mean.
Another example: We'd like to hire you, the mean number of hours per week you'd work is 40.
Versus:
We'd like to hire you, the mean number of hours per week you'd work is 40, and the standard deviation is 15. So your bad estimate is that you'd have two 70-hour weeks each year. But it's better than no estimate.
Sure, two points is better than one but what's special about two? I'd rather have a graph. We have computers so there's rarely a reason to compress the data so much.
We often have to compress the data down to a single decision or statistic: yes/no should I accept the job offer, how much money should I save before buying a house, or what's the probability that I'll die in the next 10 years.
I hate to quote XKCD, but it's like saying your favorite map projection is a globe (http://xkcd.com/977/). Yes, you've preserved all the data, but even with computers, your beloved graph will not make it all the way to the end.
Preserving all the data is the logical endpoint but that's not what I was suggesting. I'm just saying there's nothing special about keeping two points.
I'd rather not feed two points to my decision algorithm, whether it's machine learning or a human looking at the data. It makes more sense to make some attempt to preserve the shape of the graph unless you have strong reason to believe it's Gaussian, and even then the assumption should be checked.
I feel like it is the same kind of failing due to human perception of language that programmers have with the idea of exceptions and errors, especially the phrase "exceptions should only be used for exceptional behaviors". That's a cool phrase, but people latch on to it because of the word exception sounding like something extremely rare and out of the ordinary whereas we see errors as common, but they are in fact the same thing. Broke is broke, it doesn't matter what you call it, but thousands of programmers think differently because of the name we gave it.
We are human and language absolutely plays a role in our perception of things.