Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sounds like you already know this, but that's not great and will give a lot of false positives. In science this is called p-level hacking. The rigorous way to use hypothesis to testing is to calculate the sample size for the expected effect size and only one test when this sample size is achieved. But this requires knowing the effect size.

If you are doing a lot of significance tests you need to adjust the p-level to divide by the number of implicit comparisons, so e.g. only accept p<0.001 if running ine test per day.

Alternatively just do thompson sampling until one variant dominates.



To expand, p value tells you significance (more precisely the likelihood of the effect if there were no underlying difference). But if you observe it over and over again and pay attention to one value, you've subverted the measure.

Thompson/multi-armed bandit optimizes for outcome over the duration of the test, by progressively altering the treatment %. The test runs longer, but yields better outcomes while doing it.

It's objectively a better way to optimize, unless there is time-based overhead to the existence of the A/B test itself. (E.g. maintaining two code paths.)


I just wanted to affirm what you are doing here.

A key point here is that P-Values optimize for detection of effects if you do everything right, which is not common as you point out.

> Thompson/multi-armed bandit optimizes for outcome over the duration of the test.

Exactly.


The p value is the risk of getting an effect specifically due to sampling error, under the assumption of perfectly random sampling with no real effect. It says very little.

In particular, if you aren't doing perfectly random sampling it is meaningless. If you are concerned about other types of error than sampling error it is meaningless.

A significant p-value is nowhere near proof of effect. All it does is suggestively wiggle its eyebrows in the direction of further research.


> likelihood of the effect if there were no underlying difference

By "effect" I mean "observed effect"; i.e. how likely are those results, assuming the null hypothesis.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: