Methodology

Wilson 90% confidence intervals, on every measurement

Aoraforge

· Updated 2026-05-08

The problem with one-shot audits

Ask ChatGPT or Perplexity the same question twice — odds are you'll get a different answer. Independent research (SparkToro and Gumshoe, 2,961 prompt-tests, 2025) found AI search returns the same brand list less than 1% of the time. Rank order matched on roughly 1 in 1,000 runs. So if a vendor checks once and your brand isn't there, that's not a measurement — it's a coin flip.

A single audit observation cannot distinguish "you're invisible" from "the model rolled poorly that millisecond." Single-number citation rates without intervals are decoration, not measurement.

What Aoraforge does instead

For each (query, AI-platform) pair, we replicate the polling enough times to compute a real Wilson 90% confidence interval. We then report the citation rate with a Wilson 90% confidence interval (Wilson 1927; Brown, Cai & DasGupta 2001).

The Wilson score interval is the standard for binomial proportions in:

Clinical epidemiology — disease-prevalence reporting
A/B testing — Optimizely, GrowthBook, Google Optimize default
Election forecasting — confidence on poll-of-polls

It's chosen over the naive Wald interval (p̂ ± z·√(p̂(1−p̂)/n)) because Wald is badly miscalibrated near p=0 and p=1 — exactly the regions an AI search citation report cares about (you're either reliably cited or reliably not).

The formula

Given k successes (cited polls) in n trials, the Wilson 90% CI (z = 1.645, two-tailed) is:

``` center = (p̂ + z²/(2n)) / (1 + z²/n) half_width = z · √(p̂(1−p̂)/n + z²/(4n²)) / (1 + z²/n)

interval = [max(0, center − half_width), min(1, center + half_width)] ```

How to read it

Every citation rate Aoraforge reports looks like this:

`` 9/10 = 90.0% [65%, 98%] ``

That's k=9 successes; point estimate p̂ = 0.90; 90% CI [0.65, 0.98]. We're 90% confident the true citation rate is between 65% and 98%.

A wide range is itself a useful finding — it tells you AI hasn't decided yet. That's your opening:

| Hits | Range | What it means for you | |---|---|---| | 9 of 10 | 90% (narrow, [65%–98%]) | Reliably cited. This slot is yours. Defend it. | | 1 of 10 | 10% (narrow, [2%–35%]) | Reliably invisible. Someone else owns the slot. | | 5 of 10 | 50% (wide, [27%–73%]) | Contested. AI hasn't picked a winner yet — this is an open seat for the first business to publish a real answer. |

A narrow range says "this is locked in." A wide range says "this is still up for grabs." Both tell you what to do next.

How to compare audits

When you're comparing AI search audits, ask one question: did they report a range, or a single number? If there's no range, the number is decoration — AI returns the same brand list less than 1% of the time when asked twice (SparkToro + Gumshoe, 2025), so any single-number report is a guess presented with decimal places.

Frequently asked

How many polls per query? Each query is replicated enough times to compute a Wilson 90% confidence interval. Wide CIs reflect AI search itself being unstable on that question — useful information about which queries are open-territory.

---

Primary references: Wilson, E.B. (1927). Probable Inference, the Law of Succession, and Statistical Inference. JASA 22(158): 209–212. Brown, L.D., Cai, T.T., DasGupta, A. (2001). Interval Estimation for a Binomial Proportion. Statistical Science 16(2): 101–117.