As a personal anecdote, I had a fairly involved application that built up a cont...

As a personal anecdote, I had a fairly involved application that built up a context with a lot of custom prompting and created a ~1000 word output. I could run my application over and over again to inspect the results. It was fairly reproducible.

I was having really nice results with the o4-mini model with high thinking. A little while after GPT-5 came out I revisited my application and tried to continue. The o4-mini results were unusable, while the GPT-5 results were similar to what I had before. I'm not sure what happened to the model in those ~4-5 months I set it down, but there was real degradation.