← The Search Session

Log Files, AI Bots, and the Real Mechanics of AI Search | Metehan Yeşilyurt

Gianluca Fiorelli · 53m
ai-searchlog-analysisai-botsperplexityembedding-similarityseo-technicalllm-vulnerabilitiesretrieval-augmented-generation

Résumé

Metehan Yeşilyurt, a Turkish SEO practitioner known for breaking open AI search systems, walks through the operational mechanics of optimizing for AI Overviews, AI Mode, and Perplexity. His core thesis: we've shifted from chasing clicks in deterministic SERPs to chasing citations in machine-generated text — the retrieval and re-ranking layers are where the real optimization happens now. On the technical divergence between Google's AI products: AI Overviews, AI Mode, and Web Guide are different products with different ranking behaviors, even though all run on Gemini. Informational queries suffer most (recipe sites losing traffic), while commercial queries behave differently. The "great decoupling" is visible in Search Console: impressions flat or rising, clicks dropping — the classic crocodile chart. Yeşilyurt's most actionable contribution is his log file analysis methodology for AI bots. Rather than generating synthetic prompts to measure visibility (which he considers unreliable due to personalization and temperature settings), he analyzes server logs to see which pages LLM crawlers actually visit. Key question to clients: "What are your top pages for LLMs vs your top pages for humans?" The mismatch reveals optimization gaps. Google's layout parser is the most sophisticated; other LLM user agents use similar but less capable systems. On Perplexity specifically: it achieves lower hallucination rates by using high embedding similarity thresholds and running multiple fact-check passes. It also collects diversified results from multiple search engines. The first citation result influences which follow-up questions are shown — a feedback loop that compounds early citation advantage. The conversation surfaces four current LLM vulnerabilities that reveal how these systems actually work: (1) recency bias — recent dates in documents influence re-ranking, (2) lost-in-the-middle problem — LLMs lose attention mid-content, solved by FAQs at end + tables in middle, (3) data poisoning — ~250 documents can influence any LLM's training data, (4) prompt injection — alternative sentences in web pages can still influence citations. These aren't recommendations to exploit — they're architectural tells about how retrieval-augmented generation actually weights signals. Yeşilyurt's workflow starts with technical audit, then topical cluster mapping, then embedding analysis (how the brand is represented in different embedding models — Google, OpenAI, Claude all differ), then user intent research from real questions (PAA, Reddit, Quora). He explicitly moved from keyword research to user intent research as the primary planning input.

Points clés

  • Log file analysis for AI bots > synthetic prompt tracking. Ask: what are your top pages for LLMs vs for humans? The mismatch reveals optimization gaps
  • AI Overviews, AI Mode, and Web Guide are three different Google products with different retrieval and ranking behaviors despite all using Gemini
  • Perplexity uses high embedding similarity + multi-source diversification + multiple fact-check passes to reduce hallucination — first citation influences follow-up questions (compounding advantage)
  • Four LLM vulnerabilities as architectural tells: recency bias, lost-in-the-middle, data poisoning (~250 docs sufficient), prompt injection still works on citation selection
  • Lost-in-the-middle fix: put FAQs at end of content, tables in middle — this recovers LLM attention in the zone where it naturally drops
  • Embedding representation differs across models (Google, OpenAI, Claude) — a brand can be well-represented in one and invisible in another
  • Keyword research → user intent research: collect real questions from PAA, Reddit, Quora rather than relying on keyword tools
  • Google has 20+ years of spam fighting experience; OpenAI and Perplexity need to build in-house spam/quality teams for the retrieval layer

Citations notables

"We are shifting our focus from deterministic search engines to probabilistic systems. I believe this is the hardest to explain part at the moment."
"What are your top pages in LLMs and what are your top pages in humans? If your brand is mentioned in LLMs, it doesn't guarantee you will drive traffic."
"250 documents are enough to poison any LLM data at the moment."
"You can use more FAQs at the end and tables in the middle of your content and you can see the difference."