Prompt sensitivity is the property that makes AI search hard to measure. The same buyer asks the same question in two slightly different phrasings, and the model returns two materially different lists of recommended products. Measured in real research, semantically equivalent prompts can produce 7 to 19 percent differences in brand mentions, and synonym replacements alone can swing mention likelihood by up to 78 percent.
Why this matters for visibility
If you only test one phrasing of a buyer question, you are measuring one slice of a much larger surface. Your real visibility for the question is the average across the natural distribution of how buyers actually ask it. A serious AEO programme tests three to five phrasings of every priority question and aggregates them.
What drives prompt sensitivity
Two things, mostly: the corpus the model was trained on (which phrasings it has seen most often), and which retrievals the answer engine pulls in for that specific phrasing. ChatGPT pulling in a Reddit thread for a colloquial phrasing and a vendor comparison article for a formal phrasing will produce two different answers. You cannot control the corpus, but you can control which evidence wins the retrieval, by being the page that is hardest to ignore on the question.
How to test for it
Take a priority buyer question and write five phrasings: the literal one, the colloquial one, a price-led one, a feature-led one, and a workflow-led one. Run all five through your sampling pipeline. The gap between your highest and lowest mention rate across the five is your prompt sensitivity for that question. If it is wide, your visibility is fragile.