You could only be measuring in aggregate, no? Overall signal could be positive but one element happens to be negative while another is overly positive.
Well, adjusting nudges in aggregate but diced in various ways. Measured very much not in aggregate. We’d see positive and negative outcomes roll in over multiple years and want it per identifier (an individual). I’ve heard of companies generating a model per person but we didn’t.
A silly amount of work but honestly lots of value. Experimentation optimising for short term goals (eg upgrade) is such a bad version of this, it’s just all that is possible with most datasets.