Why doesn’t AI Optimize use stat. sig.?

Learn why strength better fits adaptive, machine-learning–based optimizations.

Unlike test optimizations, which rely on statistical significance (stat. sig.) to measure confidence in results, AI Optimize uses a metric called strength. This shift is more than a surface-level difference — it reflects how adaptive optimizations work and how we can evaluate results more accurately in a dynamic environment.

Why AI Optimize doesn’t use statistical significance

Statistical significance is a reliable metric in traditional A/B testing, where traffic is split evenly and experiences are tightly controlled. It’s designed to answer the question: “How confident are we that this result isn’t just random?”

But AI Optimize doesn’t run in a fixed environment — it adapts continuously to visitor behavior and new variations. Because of this, the conditions used to calculate stat. sig. are constantly shifting. The result? Stat. sig. can appear unstable or misleading in adaptive experiences — for example, showing 99% one week, then dropping to 65% the next, not because something broke, but because the visitor data shifted.

To give you clearer, more consistent insights, AI Optimize uses strength: the probability that a variation will outperform No Change based on historical performance. While strength can still fluctuate as new data comes in, it’s more intuitive, more interpretable, and more robust in real-world adaptive testing.

How test optimizations measure confidence

In traditional test optimizations, statistical significance is calculated using a method called sequential hypothesis testing. This approach compares the average performance of a treatment group to a control group, factoring in variability and how long the test has been running.

The goal is to determine whether the difference between groups is large enough — and consistent enough — to be confident that the result isn’t just noise. When statistical significance is reached, you can assume the outcome is reliable 95% of the time (based on the confidence interval).

How AI Optimize calculates strength

AI Optimize uses a different approach: it doesn’t assume your variation and control are equal at the start. Instead, it begins with no assumptions at all — this is called an uninformed (or uninformative) prior. Every variation starts with a 50% probability of beating No Change.

As data is collected, this prior is updated into what’s called a posterior, representing our best estimate of how likely the variation is to perform better than No Change. This is calculated continuously using Bayesian inference — a method that updates its confidence as new information comes in.

The result is the strength score: the probability that a variation will outperform No Change, based on how it’s been performing so far.

Why dynamic traffic makes this harder

AI Optimize dynamically allocates traffic to variations that perform better — this improves conversions, but it also complicates analysis. Unlike tests with fixed traffic splits, not all variations get equal exposure. This creates what’s known as Simpson’s paradox: when an overall result seems to say one thing, but a breakdown by time tells a different story.

For example, here’s what can happen with uneven traffic distribution:

Month	Variation 1	Variation 2
Month 1 (1,000 views)	10% CVR with 10% of traffic	10% CVR with 90% of traffic
Month 2 (1,000 views)	50% CVR with 90% of traffic	50% CVR with 10% of traffic

Overall, Variation 1 looks like it’s performing far better. But in reality, both variations had the same performance in each month — they just got traffic at different times. That’s Simpson’s paradox in action.

How Optimize solves for this

To avoid misleading conclusions like the example above, AI Optimize uses a technique called stratified statistics. Instead of comparing variations as a single average, it breaks performance into smaller groups — or strata — based on time or other attributes.

Each strata is analyzed individually, and the results are weighted based on how much traffic that group received. This method ensures that high-traffic days have more influence than low-traffic days, avoiding the over-smoothing and distortion that comes from averaging everything together.

This is what allows strength to provide a clearer, more fair comparison between variations — even in complex, adaptive experiments.