Gigerenzer’s simple rules of thumb often outperform complex models

haltingproblem · on March 3, 2021

This result is still not widely known but a simple equal weighed rule outperforms Markowitz mean variance for most alpha trading strategies i.e. strategies that predict outperformance.

Peter Muller shows this in an original paper in 1993 [1]. Peter Mueller went on to start PDT and make Morgan Stanley billions of dollars in profits.

The same effect can be shown to a limited extent for index portfolios[2] though less scalable because higher cap stocks have more float.

Related is the Kelly Betting criteria - about 50 years old [3] which gives a simple rule for picking and allocating capital to bets.

These 2 ideas - equal weighted portfolios and fractional betting would put a lot of quants out of business.

[1] https://www.semanticscholar.org/paper/Financial-optimization...

[2] https://www.semanticscholar.org/paper/Why-Does-an-Equal-Weig...

[3] https://en.wikipedia.org/wiki/Kelly_criterion

soVeryTired · on March 3, 2021

Decent quants know about the shortcomings of Markowitz allocation. The mean is more or less impossible to measure, and covariances shift over time and are awkward to measure concurrently. So the art is in specifying the parameters. I've used zero mean and block-diagonal covariance in asset allocation before, which isn't far off 1/N allocation.

The issue with Kelly betting is the same as with Markowitz: knowing the parameters of the problem. If you overstate your probability of win by mis-estimating, you'll get ruined. For what it's worth, the version of Kelly betting I'm familiar with is for Bernoulli trials. I think the Gaussian version of the Kelly criterion essentially reduces to Markowitz.

Side note: kelly betting isn't new! Daniel Bernoulli knew about it back in the 1700s. Kelly just connected the formula to ideas in information theory.

perl4ever · on March 3, 2021

>covariances shift over time

This is what I thought when I first ran across the theory in a book. I feel like it's a widespread pattern, there is some big intractable problem, and some genius says "aha, I can reduce this to..." but what they have reduced it to is also an intractable problem, only because of the novelty it feels like progress.

Everybody goes around saying "ha, you can't predict future returns from prior returns, aren't we smart", but how can you predict future covariances from prior covariances any better?

I'm not sure how useful Kelly betting is though, for similar reasons. Isn't it premised on knowing your edge, so knowing how much to bet is not such a valuable secret; you need to figure out how to identify your edge reliably. Particularly that it's not negative.

I eventually decided the way to go is (1) start with every idea or security equal weighted, and (2) basically never rebalance, because things that outperform do tend to continue to do so and the "imbalance" reflects new knowledge disseminated by the market.

wenc · on March 3, 2021

I've often wondered about the 1/n strategy. It's simple enough: invest in n assets equally, and rebalance them regularly.

However, there are some questions: how does one know which n assets to pick? Should they be mostly decorrelated? How decorrelated? And shouldn't all n assets have > 50% probability of growth? And what should the number n be? These aren't always easy questions to answer. After all, you could pick n dud assets that are highly correlated, and the strategy would fail.

I'm curious: how do folks apply the 1/n strategy in practice?

haltingproblem · on March 4, 2021

Generalizing, there are 3 problems:

1- Alpha prediction (forecasting a return for each asset)

2- Portfolio Construction

3- Portfolio realization - trading

I was referring to problem #2, which assumes you have a priori alpha forecasts and the relative confidence of those forecasts. In the general portfolio construction problem, the past mean returns of stocks is taken to be the future return.

whoisnnamdi · on March 3, 2021

This study [1] does a good job rigorously comparing 1/N to various alternative trading strategies. It also talks about situations where 1/N is more or less likely to dominate.

"Our simulations show that optimal diversification policies will dominate the 1/N rule only for very high levels of idiosyncratic volatility."

[1] http://faculty.london.edu/avmiguel/DeMiguel-Garlappi-Uppal-R...

cwillu · on March 3, 2021

Effectively, every "knob" your model has is a major point against the model, but because people don't really get that (it's another "Linda is a bank teller and is active in the feminist movement" instance) they tend to overcomplicate their models compared to the amount of validation they've performed on the model.

An alternative to putting nearly all the weight on a single complicated model created ex nihilo is to have some reasonable "zero information" prior distribution over models (weighted by complexity) where, because rules-of-thumb are very simple models, those simple models dominate. In fact, the alternative offered is itself a simple rule-of-thumb approach to this, basically saying that the more complicated models can be almost entirely neglected, and so you can really just do an even split over rules-of-thumb and still outperform any model that naïvely tries to “know best” by accounting for everything up front.

In short, it's really fucking hard to properly validate that a model applies to reality, and until you appreciate just how hard it is, you're much better off radically biased towards simplicity.

smabie · on March 3, 2021

This applies to trading perfectly as well. It's rare that the true math studs are the rainmakers (though it does happen), and instead the street smart players who just try to reasonable end up dominating. By "street smart", I don't mean people who don't do math though, I'm just talking about people who aren't math olympiads or don't have PhDs in theoretical mathematical fields.

80% models take 20% of the work and are usually more robust during market dislocations and highly resistant to noise and uncertainty compared to very mathematical approaches.

Really good point about verifying a model as well. Once your model is complex enough, it becomes highly probably that much of the model is actually hurting you, but you will have an incredibly hard time determining which part is hurting you. With a simple model, things become a lot easier and as such, the chance of success becomes much higher.

Many people have this misconception that sophisticated players like Citadel or Jane Street are using these complex non-linear models to generate profit. This couldn't be further from the truth: linear models dominate the space at the moment and are the most valuable tool you have.

Of course, linear regression isn't going to do a good job at identifying cat pictures or translating text, but that's because those tasks have relatively high signal to noise ratios. The cleaner the data, the less random the data, the better complex models perform. On the other hand with any kind of financial data, the signal is incredibly weak and a complex non-linear model will most likely get confused by the random patterns in the data.

phkahler · on March 3, 2021

Optimized systems are fragile. Probably true of models.

The most profound class I had in college was thermodynamics, where all the complexity of physical systems is lumped into a few parameters. It seems crude but those parameters include all the energy and can be used to make very useful calculations.

cwillu · on March 3, 2021

Is this a fair restatement: “Models (and systems) that accurately reflect reality with a high fidelity due in part to their accurate assignment of values to a larger number of parameters are fragile”?

I agree that that's true, but my point (and I think part of what the article is saying) is that most “optimized” systems/models are not in fact doing that. In which case, the point I'd take (if I read you correctly) is that even once you _have_ managed to accurately set the parameters (which is harder to do and validate than it looks), you still have a fragile model/system.

Alternatively [edited to add]:

Define a model as a system that is attempting to match another system in behaviour. And so now you have two problems: ① a complicated model is itself fragile (by analogy to the system it's modelling) ② a complicated system has more ways to diverge from the system it's trying to model

Hmm, in which case it's kinda the same thing after all, except one is about making the model in the first place, and the other is maintaining it, maybe?

cwillu · on March 3, 2021

Interesting point re: thermodynamics though.

It occurs to me that there's a sort of “learned helplessness” trap you can fall into from the thermodynamic base case: you can fail to separate _anything_ from placebo if your model fails to include enough of the real levers in the modelled system vs the levers it has that don't exist in that system. You can turn any signal into noise if you slice it up too much; that's basically half of the replication crisis. Studies on anti-depressants are an interesting example, for instance.

renewiltord · on March 3, 2021

It’s helpful to add a regularization term to your model so that as you’re building it, you penalize high parameter models against low parameter models. It’s commonly done in ML contexts but the idea is sound.

A similar idea in genetic algorithms is to penalize the length of the “genome” being optimized.

All sort of related to the bias variance trade off, overfitting, and the fact that with sufficient order of polynomial you can “perfectly” fit any number of points to create the worst approximator.

montecarl · on March 3, 2021

In case anyone is curious, the most aggressive form of regularization is sometimes called L0 regularization or sparse regression. Here, there is a penalty term for every additional parameter used in the model. This allows you to define a cost associated with model complexity (i.e. if an improvement to the fit is not greater than some value, the number of parameters is not increased). Solving for the best set of parameters can be tricky and thus LASSO (or L1 regularization) is often a good substitute.

bopbeepboop · on March 3, 2021

“In the ballpark”

It’s not really new, to say that you should use simple heuristics to bound fancy math. Physics does it all the time — eg, unless you’re standing next to a black hole, your Newtonian and relativistic calculations should nearly agree.

The problems with simple metrics arise when you have complex distributions that are poorly represented by mean, eg the shell of a sphere.

If you design a product to appeal “center of sphere”, then it’ll be mediocre to everyone and you end up being out competed on every front by people who design slightly more “biased” products.

I spend a lot of my time explaining that to executives.

prionassembly · on March 3, 2021

A documentary I strongly recommend to everybody (it's on YouTube): Tao Russell's "Being in the world".

It's ostensibly about the life and work of Hubert Dreyfus, the famous MIT philosopher that was an AI skeptic. But it interviews carpenters and jazz and flamenco musicians and cooks etc. and (way oversimplifying here) the takeaway is that the skill is not in you, it's in this moment of skillfull interaction with the world.

Gigerenzer seems to come awfully close to this point, but keeps missing it. At the critical moment the genius baseball batter is one with his bat, with the wind, with the noises, with each signal coming from his musculature -- this is not a Cartesian subject metaphorically manipulating levers and keys to move a robot, this is a situation. It's a whole thing.

This is why even in sports where pitch sizes are regulamented (like soccer) there's a substantial home game advantage even in the absence of crowds. That's where they train. This is why pure mathematicians obsess over that particular chalk. This is why Kenny G sounds so shitty, he's recording against a backing track in an aquarium in a studio, and why improvised jazz sounds so great.

techer · on March 3, 2021

"Standard pitch measurements. Not all pitches are the same size, though the preferred size for many professional teams' stadiums is 105 by 68 metres (115 yd × 74 yd) with an area of 7,140 square metres (76,900 sq ft; 1.76 acres; 0.714 ha)."

The field dimensions are within the range found optimal by FIFA: 110–120 yards (100–110 m) long by 70–80 yards (64–75 m) wide.

Managers can instruct groundskeepers to alter the pitch size at home.

I heard that referees are not feeling pressured into giving or not giving penalties that they may have been pressured into due to home crowds.

Some sports have fixed sizes.

I’ll watch your recommendation. Thanks

ToJans · on March 3, 2021

This feels like Gigerenzer has read my mind. Two weeks back I blogged about a simple heuristic to model uncertainty in software estimates [1], and I received confirmation from several people that my proposed technique might be spot-on: estimate your happy path, figure out how many dimensions your uncertainty has, and multiply the happy path estimates by (sqrt PI ^ dimensions)

[1] https://tojans.me/posts/software-estimates-done-right/

mjburgess · on March 3, 2021

I think the only reason this works is that it's a power law. A power law gives you a semi-principled way of talking about different scales: atomic, table, house, world, galaxy... The rate of the power increase corresponds to our "magnification factor"

Probably many sequences of powers works, and many bases: base^0.5, base^1, base^1.5, base^2, ...

eg., 4^0.5, 4^1, 4^1.5, 4^2, 4^2.5 = [2, 4, 8, 16, 32]

If the estimate is 3days, then the scope is: [3, 6, 12, 24, 48, 96]

Now we have a "semi-principled" set of numbers to choose from, the estimator applies their domain knowledge to select the relevant scales: 3 to 24 say, based on "how big the project could be".

ie., is it both possibly a small change in one area (to a table), and possibly "reworking the foundations of the house" -- if so, you have the small-scale starter number, and you can dial up using this sequence to whatever magnification suits.

Probably with base pi, and sequence (0.5, 1, 2) we capture the three most relevant scales for most properly discrete software tasks. That's essentially conincidental, in that i'm sure a (2 to 4) base with several sequences would wokr.

ToJans · on March 3, 2021

Update: looks like my assumptions might be completely wrong... Please ignore my initial post, and I will check if/when I can provide an update to the blog post, potentially discarding the sqrt PI idea.

_v7gu · on March 3, 2021

Okay, I'm stumped. Isn't the gaussian function a probability density function, which means it should have an area of 1 by definition? Are you taking it as f(x) = exp(-x^2)? To keep f(0) equal to 1?

em500 · on March 3, 2021

You're stumped because most of the "statistics primer" section in that post doesn't make any sense. The connection between the Gaussian density and the sqrt(pi) heuristic is mostly imaginary. The original heuristic (pi, sqrt(pi), pi^2) works pretty much the same with 3 instead of pi, so you can view the pi versions as numerology or charitably a nice mnemonic.

ToJans · on March 3, 2021

Could you elaborate? Assume we convert all our happy path estimates to minutes. What I'm saying is that each "estimated minute" is more likely to have a gaussian distribution than a uniform standard distribution,because normal distribution is more likely to occur in nature.

while I can understand that this is a controversial assumption, I'm not the first one to make it. Referring to numerology seems a bit odd?

I'm really looking for a proper way here, I'm quite a rational person, so numerology is not really my cup of tea TBH.

em500 · on March 3, 2021

Your blog post has so many errors that I don't even know where to start. As another poster mentioned, areas under non-degenerate probability density functions are 1 by definition, whether they're uniform, Gaussian, standardized or not. What you described as a "standard uniform distribution" is really a degenerate distribution[1], meaning that you assume no uncertainty at all (stdev=0). There's nothing "uniform" about that, you might just as well start with a Gaussian with stdev=0.

"converting from a standard uniform distribution to a Gaussian distribution" as you described does not make any sense at all. If you replace an initial assumption of a degenerate distribution with a Gaussian, as you seem to be doing, you replace a no-uncertainty (stdev=0) assumption with some uncertainty (so the uncertainty blow-up is infinite), but it doesn't affect point estimates such as the mean or median, unless you make separate assumptions about that. There is nothing in your story that leads to multiplying some initial time estimate by sqrt(pi). The only tenuous connection with sqrt(pi) in the whole story is that the Gaussian integral happens to be sqrt(pi). There are some deep mathematical reasons for that, which has to do with polar coordinates transformations. But it has nothing to do with adjusting uncertainties or best estimates.

[1] https://en.wikipedia.org/wiki/Degenerate_distribution

ToJans · on March 3, 2021

Thank you for your valuable feedback; it will take some time to process, and I will adjust the blog post as my insights grow (potentially discarding the whole idea, but for me it's a learning process.)

yorwba · on March 3, 2021

> What I'm saying is that each "estimated minute" is more likely to have a gaussian distribution

If that were your actual assumption, you should measure the variance of the difference between your estimate and the actual time taken, use that to determine a confidence interval (e.g. 95% of the time, the additional delay is less than x) and then add it to your estimate.

ToJans · on March 3, 2021

Addendum: this is what I was referring to: https://en.wikipedia.org/wiki/Gaussian_integral

ToJans · on March 3, 2021

My apologies, I'm by no means well versed in math, so I might use the wrong terminology.

Here is a good example video of what I am alluding to: https://www.youtube.com/watch?v=9CgOthUUdw4

Maybe I should refine my blog post or include the video in the explanation?

sumtechguy · on March 3, 2021

You are the second person I have come across that uses PI to get software estimates. "I am not sure why but it works, except when it doesnt then I know exactly why"

What I found was most people over/under estimate things. They also tend to do it consistently at the same rate. Typically they have a scaling factor you can use. Around 3 seems to be the sweet spot for most people. You could just as easily use 3.1 or 3.2 and get a similar answer. I usually go for 3 because I can do that in my head without a calculator.

ToJans · on March 6, 2021

**** I stand corrected and marked that section from the blog post as BS; please accept my apologies. ****

bujak300 · on March 3, 2021

I can recommend the "Simple heuristics that make us smart" book by Gigerenzer, it's an insightful and entertainig read: for this book he and his co-authors put their money where their mouth is and invested on the stock market based on the most simple heuristics imaginable.

wrp · on March 3, 2021

Gigerenzer has published several books, but I think that is the best one, because later books are just repeating the ideas. Although I think his ideas are good, they have drawn a lot of criticism. There are several discussion articles in issue 23(5) of Behavioral and Brain Science.

daemoncoder · on March 3, 2021

TED Talk

https://youtu.be/-Lg7G8TMe_A

cwillu · on March 3, 2021

TEDx you mean. Which matters quite a bit, given that there are effectively no restrictions on who can give a TEDx talk :p

Which isn't to say that it's useless, just that the "TED" gatekeeping role isn't actually happening here, instead the gatekeeper is "people who have heard of TEDx", which is very different and imo quite a bit less useful.

yboris · on March 3, 2021

Even I had a TEDx talk ;) https://www.youtube.com/watch?v=DPe9-OM7vgU

kashyapc · on March 3, 2021

This is less well-known, but worth noting: Gerd Gigerenzer is a long-time critic of Kahneman and Tversky's work (quite popular here, understandably). So it is instructive to see the other side of the coin.

I'm a little embarrassed that I only learnt of Gigerenzer's work only a few months ago, when I accidentally discovered an EconTalk podcast[1] with him.

He talks about the "fast and frugal trees"[2] heuristic, criticism of the "Linda Problem" (mentioned in the article; and, incidentally, was also discussed on HN last week on that 'cognitive bias' thread[3]), and other interesting bits. It's definitely worth a careful listen—especially if you've previously read Kahneman's book.

[1] https://www.econtalk.org/gerd-gigerenzer-on-gut-feelings/

[2] https://en.wikipedia.org/wiki/Fast-and-frugal_trees

[3] https://news.ycombinator.com/item?id=26263458

beaconstudios · on March 4, 2021

nice to see someone being critical of Kahneman's ideas. I broadly like the rationalist community in general, but the focus on trying to system-2 all the things is ludicrous.

kashyapc · on March 4, 2021

Yeah; trying to "system-2 all the things" is indeed ludicrous, and is simply unsustainable in terms of energy, which we can well attest from our own experiences. Kahneman himself doesn't advocate that extreme, but he does nudge you in that direction.

Also see: the failed collaboration with his intuition-focused colleague, Gary Klein. They tried to map "the boundary conditions that separate true intuitive skill from overconfident and biased impressions." They couldn't arrive at an agreement.

They documented their failure to reach an agreement in their paper: "Conditions for intuitive expertise: A failure to disagree"[1].

[1] https://pubmed.ncbi.nlm.nih.gov/19739881/

haltingproblem · on March 4, 2021

I never got the system I and system II distinction. Sounds like a modern version of the bicameral mind. Is the System I/II notion falsifiable? The duality also reminds of the mind/matter boundary.

But what do I know? Kahneman got a Nobel for it and along with Tversky is considered one of the most brilliant psychologists of the 20th century.

chillacy · on March 6, 2021

So much has been built upon this framework in psychology and decision making that it’s probably work reading Thinking Fast and Slow. Not all of it has survived the replication crisis (in particular some of the experiments on priming) but much is still relevant.

beaconstudios · on March 4, 2021

my reading of it is that system 1 is instinctual and system 2 is intellectual, but of course the boundaries are blurry and arbitrarily defined.

> The duality also reminds of the mind/matter boundary

Nice, I'm a nondualist too. Yeah it's a strange delineation into "good" system 2 and "bad" system 1.

> Kahneman got a Nobel for it and along with Tversky is considered one of the most brilliant psychologists of the 20th century.

I don't think the Nobel committee have a monopoly on picking good or bad ideas ;)

rramadass · on March 3, 2021

I find this entire field of study fascinating. The reason is simple; ever since the start of the Scientific Revolution everything has become quite complex and multivariate and we have gotten into the habit of trying to solve them all using mathematical models. This has the advantage that once proven, it can be applied "mindlessly" towards a desired result. With Heuristics so many factors have to come together in a individual for any given event that repeatability is not certain. Therein lies the difficulty.

sieste · on March 3, 2021

It really depends on what you mean by "outperform". If your goal is to catch the baseball, a simple gaze heuristic works well. If you want to understand how angle of impact, air resistance, rotation etc effect where the ball will land, of course you need a "complex" model that includes these factors. The most complete model, that provides our best explanation of the system, isn't always the model that yields the best predictions.

HPsquared · on March 3, 2021

Simple models have fewer sources of error, and the small number of parameters can be better validated against reality.

rramadass · on March 3, 2021

That certainly sounds plausible, however i feel that it is the applicability/non-applicability of parameters themselves that is the key. You need to choose the right parameters before even bothering about the model function and this is where prior knowledge and experience are important.

The works of Gary Klein and Nassim Taleb are also relevant here.

levosmetalo · on March 3, 2021

Nassim Taleb actually cites Prof. Gigerenzer in the topic of simple heuristics.

sva_ · on March 3, 2021

(2017)