This is pure speculation, but what are the chances this change is simply an attempt to provide legal cover what they might have started doing 50 versions ago?[1]
Ah but what you are interpreting in layman english is actually a term of art in marketing that means "this will change as soon as it becomes more profitable to do that".
One funny thing about Mersenne primes is that, as a result of what you describe, they are exactly those primes whose binary representation consists of a prime number of ones!
The smallest Mersenne prime, three, is binary 11, while the next largest is seven (111), then 31 (11111), then 127 (1111111). The next candidate, 2047 (11111111111), is not prime.
> the SSH certificates issued by the Cloudflare CA include a field called ValidPrinciples
Having implemented similar systems before, I was interested to read this post. Then I see this. Now I have to find out if that really is the field, if this was ChatGPT spellcheck, or something else entirely.
It depends... ssh-keygen -L displays the fields as Principals (which are set using the -n parameter) and internally a lot of the OpenSSH code talks about AuthorizedPrincipals...
> I am a simple sole, ... go back to the halcyon early days of the web before Netscape dropped the JS-bomb. You know HTML for the layout and CSS for the style.
I am not sure if this is intended as humor, but JavaScript came before CSS.
I remember when CSS Zen garden was showcasing what you can do with CSS, and browsers (well, "browser", singular, as there was basically only IE 6 back then) supported Javascript and VBScript.
It seems JavaScript was first released, just internally, in May 1995 in a pre-alpha version of Netscape 2.0. It would not be publicly announced until December 1995. Netscape 2.0 didn't even come out until March 1996 and even then it was language version 1.0 which was extremely defective. The first version of the language that actually worked was JavaScript 1.1 that came out in August 1996. CSS on the other hand first premiered with IE3 that came out in August 1996.
The distinction either way is trivial, because at that time nobody was using either CSS or JavaScript as they required proprietary APIs. There was no DOM specification at that time.
JavaScript was created by Brendan Eich in just 10 days in May 1995 while he was working at Netscape Communications Corporation
CSS (Cascading Style Sheets) was introduced later than JavaScript. The first CSS specification was published in December 1996 by Håkon Wium Lie and Bert Bos.
Early in the A-B craze (optimal shade of blue nonsense), I was talking to someone high up with an online hotel reservation company who was telling me how great A-B testing had been for them. I asked him how they chose stopping point/sample size. He told me experiments continued until they observed a statistically significant difference between the two conditions.
The arithmetic is simple and cheap. Understanding basic intro stats principles, priceless.
This is correct. There's been a lot of interest in e-values and non-parametric confidence sequences in recent literature. It's usually refered to as anytime-valid inference [1]. Evan Miller explored a similar idea in [2]. For some practical examples, see my Python library [3] implementing multinomial and time inhomogeneous Bernoulli / Poisson process tests based in [4]. See [5] for linear models / t-tests.
Sounds like you already know this, but that's not great and will give a lot of false positives. In science this is called p-level hacking. The rigorous way to use hypothesis to testing is to calculate the sample size for the expected effect size and only one test when this sample size is achieved. But this requires knowing the effect size.
If you are doing a lot of significance tests you need to adjust the p-level to divide by the number of implicit comparisons, so e.g. only accept p<0.001 if running ine test per day.
Alternatively just do thompson sampling until one variant dominates.
To expand, p value tells you significance (more precisely the likelihood of the effect if there were no underlying difference). But if you observe it over and over again and pay attention to one value, you've subverted the measure.
Thompson/multi-armed bandit optimizes for outcome over the duration of the test, by progressively altering the treatment %. The test runs longer, but yields better outcomes while doing it.
It's objectively a better way to optimize, unless there is time-based overhead to the existence of the A/B test itself. (E.g. maintaining two code paths.)
The p value is the risk of getting an effect specifically due to sampling error, under the assumption of perfectly random sampling with no real effect. It says very little.
In particular, if you aren't doing perfectly random sampling it is meaningless. If you are concerned about other types of error than sampling error it is meaningless.
A significant p-value is nowhere near proof of effect. All it does is suggestively wiggle its eyebrows in the direction of further research.
Many years ago I was working for a large gaming company and I was the one who developed a very optimal and cheap way to split any cluster of users into A/B groups. The company was extremely happy with how well that worked. However I did some investigation on my own a year later to see how the business development people were using it and... Yeah, pretty much what you said. They were literally brute forcing different configuration until they(more or less) got the desired results.
Microsoft has a seed finder specifically aimed at avoiding a priori bias in experiment groups, but IMO the main effect is pushing whales (which are possibly bots) into different groups until the bias evens out.
I find it hard to imagine obtaining much bias from a random hash seed in a large group of small-scale users, but I haven't looked at the problem closely.
We definitely saw bias, and it made experiments hard to launch until the system started pre-identifying unbiased population samples ahead of time, so the experiment could just pull pre-vetted users.
And yet this is the default. As commonly implemented, a/b testing is an excellent way to look busy, and people will actively resist changing processes to make them more reliable.
I think this is not unrelated to the fact that if you wait long enough you can get a positive signal from a neutral intervention, so you can literally shuffle chairs on the Titanic and claim success. The incentives are against accuracy because nobody wants to be told that the feature they've just had the team building for 3 months had no effect whatsoever.
This is surely more optimal if you do the statistics right? I mean I'm sure they didn't but the intuition that you can stop once there's sufficient evidence is correct.
Bear in mind many people aren’t doing the statistics right.
I’m not an expert but my understanding is that it’s doable if you’re calculating the correct MDE based on the observed sample size, though not ideal (because sometimes the observed sample is too small and there’s no way round that).
I suspect the problem comes when people don’t adjust the MDE properly for the smaller sample. Tools help but you’ve gotta know about them and use them ;)
Personally I’d prefer to avoid this and be a bit more strict due to something a PM once said: “If you torture the data long enough, it’ll show you what you want to see.”
I thought you were joking. ... After a while, I started expecting a comma after each and every word.