INSIGHTS

Keller Maloney
Unusual - Founder
Apr 20, 2026

Share of voice and adjacent AI brand tracking metrics are built by running a fixed prompt set through a model and counting what comes back. In an experiment with 100 buyer-style CRM prompts, swapping a single meaning-preserving word per prompt moved a brand's share of voice by up to 16.7 percentage points and changed roughly a third of the vendors recommended in a typical answer. Only 16 of the 100 prompt pairs produced identical vendor sets across the two wordings. The implication is that prompt-set tracking metrics largely describe the specific prompts a vendor chose, not the underlying model behavior they claim to measure.
AEO and GEO tools have organized around prompt-set tracking, and brands are making real budget decisions off the resulting numbers. Whether those metrics are stable enough to support those decisions is an empirical question. This experiment was designed to test it on the best-known of the metrics: share of voice.
Methodology
I hand-wrote 100 buyer-style prompts about CRMs, covering a spread of company sizes, use cases, and industries. "What's the best CRM for an early-stage startup?" "Which CRM is easiest to use for a 50-person sales team?" "Recommend a CRM with strong email automation." Then I rewrote each one with a single meaning-preserving synonym swap. "Best" for "top". "Affordable" for "inexpensive". "Tracking" for "observability". The intent was to hold the meaning of each prompt constant while changing the exact wording.
I ran both versions through GPT-5.4 with default settings, a few times per prompt, for 600 total API calls. The only variable between the two runs was the single swapped word.
Aggregate share of voice moved by up to 17 percentage points
Across the 13 CRM vendors with an established baseline (mentioned in at least 5% of original-version responses), share of voice shifted by an average of 6.1% between the two versions. Five of them shifted by 10% or more. Copper moved 16.7%, from a 10.0% share of voice to 11.7%, purely from the synonym swap.
Vendor | SOV (original) | SOV (variant) | Δ % |
|---|---|---|---|
Copper | 10.0% | 11.7% | +16.7% |
Microsoft Dynamics | 25.3% | 28.3% | +11.8% |
Apollo | 5.7% | 5.0% | −11.8% |
ActiveCampaign | 6.0% | 5.3% | −11.1% |
Insightly | 16.3% | 17.3% | +6.1% |
Freshsales | 39.3% | 37.0% | −5.9% |
Freshworks | 32.7% | 31.0% | −5.1% |
The three brands that barely moved were the top three: HubSpot, Salesforce, and Zoho, each mentioned in more than 78% of responses across both versions. Below that line, the numbers moved.
Two of the five vendors in a typical answer were different
Aggregate SOV smooths out a lot of per-answer churn. A more honest picture is what happens at the level of a single response.
For each of the 100 prompt pairs, I compared the three original-wording responses against the three variant-wording responses, generating 900 apples-to-apples comparisons of what a single buyer would see. A typical answer recommended about 5 vendors. About 2 of them changed identity between the two wordings.
That's a 33% turnover of the vendor set, from rewording a single word the buyer didn't think was meaningful.
Only 16 of the 100 prompt pairs produced identical vendor sets across the two wordings. The other 84 introduced or dropped at least one vendor. 35 of them introduced or dropped three or more.
A few of the biggest swings
The swings were specific and reproducible. Three representative pairs:
"What CRM has the best workflow automation?" vs. "…the strongest workflow automation?". Freshsales, Freshworks, Insightly, and Keap appeared only in the "best" version. Microsoft Dynamics appeared only in the "strongest" version.
"Best CRM for scaling a sales team?" vs. "Top CRM for scaling a sales team?". Swapping "best" for "top" pulled in Copper, Microsoft Dynamics, Monday.com, and Outreach, and dropped Insightly.
"Which CRM is best for high-velocity sales?" vs. "…for high-volume sales?". Apollo, Close, and Outreach showed up only for "high-velocity." Insightly and Zoho only for "high-volume."
By design, each prompt pair held meaning constant. The model's responses did not.
What this actually shows
The dashboard-level SOV number on a tracking tool looks stable because averaging 100 prompts together hides the per-prompt noise. For any buyer actually typing a query, that noise is the experience. Shifting a single word shifts the answer.
The underlying system isn't stable enough for a small, fixed prompt set to serve as a reliable snapshot of it. The movement between two semantically equivalent prompts is on the same order as the movement a brand is typically trying to manufacture through months of PR and content work. SOV deltas attributed to brand investment may therefore be of the same magnitude as the variance a buyer introduces by rephrasing the same question.
What's worth measuring
The experiment pins down what prompt tracking can't do. It also points at what a more useful measurement looks like.
The part of the system that is stable across rewordings is the model's underlying picture of the market. The patterns in what it associates with each brand, the contexts in which it recommends one vendor over another, the qualities it consistently attaches to each player, those don't swing on a synonym. They move when something changes in the evidence the model is reading, which is exactly the layer a brand can actually influence.
The measurement worth building a strategy on is a map of how the model thinks about you: what it believes, what it emphasizes, where it's right, where it's wrong, and how those beliefs change as you change the buyer context around the question. That's the layer underneath share of voice, and the layer worth tracking over time.
Limitations
Single model, single industry. The structural argument about sampling-driven fragility doesn't depend on which model is probed, but cross-model and cross-industry runs would strengthen the claim. Brand detection is a substring match against a hand-curated alias list, which catches explicit mentions cleanly but misses oblique references. Temperature contributes variance of its own, but the three-replicate design averages that out on each side so the original-vs-variant comparison isolates the effect of the word change.

