To my point though, you mention that you noticed a "consistent" max of 2hr. Meaning the consistency was the interesting part. You probably consider this issue to be more important than some other hypothetical issue that caused, say, literally a single request to have a 2hr latency on Tuesday.
In other words, if you monitored over a period that was long enough to establish a statistically significant p99.9 or even p99.99 and then measured those percentiles, you still would have not only caught this issue, but also distinguished it from the other hypothetical issue. Monitoring your maximum and worrying about consistently high maximums might be a more effective mechanism, but in principle, you still care more about top percentiles.
To my point though, you mention that you noticed a "consistent" max of 2hr. Meaning the consistency was the interesting part. You probably consider this issue to be more important than some other hypothetical issue that caused, say, literally a single request to have a 2hr latency on Tuesday.
In other words, if you monitored over a period that was long enough to establish a statistically significant p99.9 or even p99.99 and then measured those percentiles, you still would have not only caught this issue, but also distinguished it from the other hypothetical issue. Monitoring your maximum and worrying about consistently high maximums might be a more effective mechanism, but in principle, you still care more about top percentiles.