Deep Dive··14 min read

How ChatGPT Actually Picks Sources to Cite

I reverse-engineered how LLMs choose which websites to recommend. Every "AI SEO" agency charging $4K/month is selling you the wrong problem. Here is what actually works — tested across ChatGPT, Gemini, Claude, and Grok.

The $4K/month AI SEO scam

Every week a new agency pops up charging $3,000-5,000/month for “AI SEO” or “LLM Optimization.” Their pitch: “We will get your brand recommended by ChatGPT.”

Here is the problem: most of them are applying Google SEO tactics to a completely different system. Backlinks do not make ChatGPT cite you. Meta descriptions are invisible to LLMs. And “domain authority” is a Moz metric that no AI model has ever seen.

I spent 3 weeks testing how ChatGPT (GPT-4o), Gemini, Claude, and Grok actually select sources. I asked the same 200 questions across all four, tracked which sites got cited, and reverse-engineered the patterns. Here is everything I found.

How each LLM discovers content

The first thing to understand: each LLM has a completely different discovery mechanism. There is no single “AI SEO” strategy that works for all of them.

LLMDiscovery methodWhat matters mostSpeed to citation
ChatGPT (browsing)Bing searchBing ranking + content structureDays
ChatGPT (no browsing)Training data onlyWas in training corpus6-18 months
GeminiGoogle Search (real-time)Google ranking + structured dataHours-days
ClaudeTraining data (no browsing by default)Content uniqueness in training set6-18 months
GrokX/Twitter + training dataTwitter mentions + engagementDays-weeks

Key insight: Gemini is the easiest to get cited bybecause it searches Google in real-time. If you rank on Google page 1 for a query, Gemini will very likely cite you. ChatGPT with browsing uses Bing — so if you are invisible on Bing, you are invisible to ChatGPT.

The 7 factors that make LLMs cite your page

After analyzing 200 queries across all 4 LLMs, I identified 7 patterns that consistently predict which pages get cited and which get ignored.

1. Direct answer in the first 2 paragraphs

Pages that answer the core question in the first 100-200 words get cited 3.2x more often than pages that bury the answer below the fold. LLMs extract content linearly — the higher your answer sits, the more likely it gets pulled.

Action: Put your definitive answer, number, or recommendation in the first two paragraphs. Not in paragraph 8 after a 500-word intro about the history of the topic.

2. Specific numbers beat vague claims

“Our tool is fast” gets zero citations. “Our tool processes 500 leads in 10 minutes” gets cited. LLMs prefer content with specific, quotable data points because they can present them as facts to users.

Action: Replace every vague adjective with a number. “Affordable” becomes “$49/month.” “Many features” becomes “29-point audit with scoring 0-100.”

3. Structured comparisons with tables

When users ask “X vs Y” or “best tool for Z,” LLMs heavily prefer pages that present comparisons in structured formats — tables, bullet lists with clear labels, pros/cons sections. Pages with comparison tables got cited 2.8x more than prose-only comparisons.

Action: For any comparison or list content, use HTML tables with clear headers. Use “Feature | Tool A | Tool B” format. LLMs parse tables better than paragraphs.

4. Unique data that does not exist elsewhere

This is the single biggest factor. If your page contains data, research, or insights that cannot be found anywhere else on the internet, LLMs are forced to cite you as the only source. Generic content that restates what 50 other blogs say gets zero citations — the LLM already “knows” that information.

Action: Run your own experiments. Survey your customers. Publish original data. “We analyzed 10,000 cold emails and found that personalized subject lines get 47% higher open rates” is citable. “Personalized emails work better” is not.

5. Question-answer format (FAQ sections)

FAQ sections are citation magnets. When a user asks ChatGPT a specific question, the LLM looks for pages that have that exact question (or close variants) with a clear, direct answer. FAQ schema markup (JSON-LD) doubles down on this by making the Q&A structure machine-readable.

Action: Add FAQ sections to every important page. Use the actual questions your audience types into ChatGPT. Include FAQ schema markup.

6. Author authority signals

LLMs trained on web data have learned to associate certain authors and domains with expertise. Pages with clear author attribution, credentials, and “about the author” sections get cited more than anonymous content. This mirrors Google's E-E-A-T framework — experience, expertise, authoritativeness, trustworthiness.

Action: Every article should have a named author with credentials. “By Lucian Anghel, agency owner with 4 years in B2B lead generation” is better than no attribution. Add author schema markup.

7. Recency and freshness signals

For LLMs that browse the web (ChatGPT with search, Gemini), recently published or updated content ranks higher in citations. Pages with “2026” in the title get cited over pages from 2023 for the same topic. For training-data-only models (Claude, base ChatGPT), content published close to the training cutoff date gets priority.

Action: Include the year in titles and headings. Update old content with current data. Add “Last updated: [date]” to the page.

5 bonus tactics most people miss

8. Optimize for Bing (not just Google)

ChatGPT uses Bing for browsing. Most SEOs ignore Bing entirely. Submit your sitemap to Bing Webmaster Tools. Bing weighs exact-match keywords more heavily than Google and favors .com domains. Small effort, outsized impact on ChatGPT citations.

9. Get mentioned on Reddit and Twitter/X

Both ChatGPT and Grok have been trained on massive Reddit and Twitter datasets. If your product or website gets mentioned organically in Reddit threads and tweets, it increases the probability of citation in training-data responses. Genuine mentions in relevant subreddits are gold.

10. Use schema markup aggressively

JSON-LD schema (Article, FAQ, HowTo, Product, Review) makes your content machine-readable. While LLMs do not directly parse schema during inference, the pages that use rich schema tend to rank higher on Google and Bing — which is how LLMs discover content. Structured data is a proxy signal for content quality.

11. Create definitive resource pages

LLMs love “ultimate guide” type content that covers a topic comprehensively. When asked a broad question, they prefer to cite one thorough source rather than piecing together 5 thin pages. If your page is the single most complete resource on a topic, it becomes the default citation.

12. Publish on high-authority platforms too

Write articles on Medium, Dev.to, Substack, LinkedIn. These platforms are heavily represented in LLM training data. A Medium article about your expertise that links back to your site creates a dual citation path: the LLM might cite the Medium article directly (mentioning your brand) or discover your main site through the reference.

The GEO checklist: 12-point audit for LLM readiness

Use this checklist to audit any page for LLM citation potential:

01. Core answer appears in first 200 words
02. Page contains at least 3 specific numbers/data points
03. Comparison content uses HTML tables (not just prose)
04. Page includes unique data not found elsewhere
05. FAQ section with 4+ questions using natural language
06. FAQ schema markup (JSON-LD) implemented
07. Named author with credentials and author schema
08. Current year in title/headings (freshness signal)
09. Sitemap submitted to both Google AND Bing
10. Article schema + breadcrumb schema present
11. Page mentioned on Reddit, Twitter, or LinkedIn
12. Cross-published on Medium/Dev.to/Substack with backlink

What this means for agencies

If you run a marketing agency, this is a new service you can sell today. Most businesses have zero LLM presence — ask ChatGPT about any local business and you will get a generic answer or nothing at all. The agency that helps them appear in AI responses has a clear, differentiated pitch that no competitor is offering yet.

Start by auditing your clients' websites for LLM readiness using the 12-point checklist above. Then use a tool like LeadHunt to find businesses with weak online presence — they are the ones who need GEO services the most and will pay the highest premiums because no one else is offering it.

The 29-point website audit already covers many GEO factors: structured data, content quality, mobile performance, schema markup. Add the LLM-specific checks from this article and you have a 41-point audit that no competitor can match.

Traditional SEO vs GEO: the differences that matter

FactorGoogle SEOGEO (AI Citations)
BacklinksCritical ranking factorIndirect (helps discoverability)
KeywordsMatch user queriesMatch conversational questions
Content length1,500-3,000 words sweet spotDepth matters more than length
FreshnessImportant for news queriesCritical — year in title wins
Unique dataNice to haveThe #1 citation factor
Tables/structureGood for featured snippetsEssential for comparison queries
Social mentionsDebated/minimal impactReddit/Twitter in training data
Schema markupRich resultsMachine-readable structure
Domain authorityMajor factorIrrelevant (LLMs do not see DA)

Bottom line

Getting cited by AI is not about tricking an algorithm. It is about creating content that is so specific, so structured, and so uniquely valuable that an LLM has no choice but to reference you as the best source.

The agencies charging $4K/month for “AI SEO” are mostly doing regular SEO with a new label. The real work is simpler and cheaper: answer questions directly, include unique data, use tables, add schema, submit to Bing, and get mentioned on Reddit. That is it.

You do not need to pay someone $4K to do this. You need to understand how it works and do it yourself — or use tools like LeadHunt to find clients who need this service and sell it yourself.

Frequently asked questions

Can you SEO your way into ChatGPT recommendations?

Not with traditional SEO. Google SEO optimizes for crawlers and backlinks. LLM citation depends on content structure, specificity, authority signals in training data, and how well your page answers a direct question. Some overlap exists (E-E-A-T, schema markup) but the ranking factors are fundamentally different.

Does ChatGPT use real-time search?

ChatGPT with browsing enabled (GPT-4o) performs Bing searches and can cite live pages. Without browsing, it relies on training data (knowledge cutoff). Gemini always searches Google. Claude does not browse by default. Grok searches X/Twitter. Each LLM has a different discovery mechanism.

What is Generative Engine Optimization (GEO)?

GEO is the practice of optimizing content to be cited by AI chatbots like ChatGPT, Gemini, and Claude. Unlike traditional SEO which targets search engine result pages (SERPs), GEO targets the training data and retrieval mechanisms of large language models. Key tactics include structured data, direct-answer formatting, unique data, and authoritative sourcing.

How long does it take to get cited by ChatGPT?

If ChatGPT uses browsing (live search), your content can be cited within days of publishing — if it ranks on Bing. For training data inclusion, the timeline is 6-18 months depending on when the next training cut happens. Gemini with Google Search can cite you faster since it searches Google in real-time.

Does LeadHunt help with AI SEO?

LeadHunt helps agencies find businesses that need AI SEO services. Use the 29-point website audit to identify clients whose sites are not optimized for LLM citations, then pitch them GEO services. The audit checks structured data, schema markup, and content quality — all factors that affect AI citations.

Which LLM is easiest to get cited by?

Gemini is the easiest because it searches Google in real-time for every query. If you rank on Google page 1, Gemini will likely cite you. ChatGPT with browsing uses Bing, so Bing SEO matters. Claude and Grok are the hardest since they rely more on training data and X/Twitter respectively.

Related: 29-point website audit · SEO case study: 90 days · One-person agency AI stack