UPDATES

The Problem With Prompt Tracking

The Problem With Prompt Tracking

Keller Maloney

Unusual - Founder

Apr 20, 2026

Most AI brand tracking metrics (share of voice, mention rate, citation rate, brand rankings) sit on the same foundation: pick a set of prompts, run them through a model, and count what comes back. Brands are building strategies on top of these metrics, so it’s important to examine how durable this metric is.

So I ran an experiment on share of voice, the most well-known of these metrics and a popular metric that some of the industry has been organizing around.

I hand-wrote 100 buyer-style prompts about CRMs, covering a spread of company sizes, use cases, and industries. "What's the best CRM for an early-stage startup?" "Which CRM is easiest to use for a 50-person sales team?" "Recommend a CRM with strong email automation." Then I rewrote each one with a single meaning-preserving synonym swap. "Best" for "top". "Affordable" for "inexpensive". "Tracking" for "observability". The idea was to keep the meaning of the prompt constant while adjusting the exact wording.

I ran both versions through GPT-5.4 with default settings, a few times per prompt, for 600 total API calls. The only variable between the two runs was the single swapped word.

A brand's share of voice moved by as much as 17% from a single word swap

The headline metric every AEO/GEO vendor sells is share of voice, the percent of AI answers that mention your brand across a fixed prompt set. A few percentage points of movement in that number is the kind of thing that triggers a PR push or a content investment.

In this experiment, the 13 CRM vendors with an established baseline (mentioned in at least 5% of original-version responses) shifted their share of voice by an average of 6.1% between the two versions. Five of them shifted by 10% or more. Copper moved 16.7%, from a 10.0% share of voice to 11.7%, purely from the synonym swap.

Vendor

SOV (original)

SOV (variant)

Δ %

Copper

10.0%

11.7%

+16.7%

Microsoft Dynamics

25.3%

28.3%

+11.8%

Apollo

5.7%

5.0%

−11.8%

ActiveCampaign

6.0%

5.3%

−11.1%

Insightly

16.3%

17.3%

+6.1%

Freshsales

39.3%

37.0%

−5.9%

Freshworks

32.7%

31.0%

−5.1%

The only brands that barely moved were the entrenched top three, HubSpot, Salesforce, and Zoho, each appearing in more than 78% of responses across both versions. Everything below that line was noisy.

Two of the five vendors in a typical answer were different

Aggregate SOV smooths out a lot of per-answer churn. A more honest picture is what happens at the level of a single response.

For each of the 100 prompt pairs, I compared the three original-wording responses against the three variant-wording responses, generating 900 apples-to-apples comparisons of what a single buyer would see. A typical answer recommended about 5 vendors. About 2 of them changed identity between the two wordings.

That's a 33% turnover of the vendor set, from rewording a single word the buyer didn't think was meaningful.

Only 16 of the 100 prompt pairs produced identical vendor sets across the two wordings. The other 84 introduced or dropped at least one vendor. 35 of them introduced or dropped three or more.

A few of the biggest swings

The swings were specific and reproducible. Three representative pairs:

  • "What CRM has the best workflow automation?" vs. "…the strongest workflow automation?". Freshsales, Freshworks, Insightly, and Keap appeared only in the "best" version. Microsoft Dynamics appeared only in the "strongest" version.

  • "Which CRM is best for high-velocity sales?" vs. "…for high-volume sales?". Apollo, Close, and Outreach showed up only for "high-velocity." Insightly and Zoho only for "high-volume."

  • "Best CRM for scaling a sales team?" vs. "Top CRM for scaling a sales team?". Swapping "best" for "top" pulled in Copper, Microsoft Dynamics, Monday.com, and Outreach, and dropped Insightly.

Each of these prompt pairs means the same thing. A buyer typing either version wouldn't notice a difference between them. The model does.

What this actually shows

The dashboard-level SOV number on a tracking tool looks stable because averaging 100 prompts together hides the per-prompt noise. For any buyer actually typing a query, that noise is the experience. Shifting a single word shifts the answer.

The underlying system is not stable enough for a small, fixed prompt set to serve as a reliable snapshot of it. The movement between two semantically equivalent prompts is on the same order as the movement a brand is trying to manufacture through months of PR and content work. I suspect that when the brands currently investing in SOV lift do the honest math, most of the movement they're paying to create is the same size as the movement that a buyer generates for free by rephrasing their question.

Reproducing this

The prompts, code, raw responses, and analysis live in the research/sov-fragility directory of the Unusual growth repo. A full run takes about 8 minutes at 8 workers and costs roughly $6 in API calls. Results reproduce with python sov.py.

Caveats

Single model, single industry. The structural argument about sampling-driven fragility doesn't depend on which model is probed, but cross-model and cross-industry runs would strengthen the claim. Brand detection is a substring match against a hand-curated alias list, which catches explicit mentions cleanly but misses oblique references. Temperature contributes variance of its own, but the three-replicate design averages that out on each side so the original-vs-variant comparison isolates the effect of the word change.