How to structure your website for AI

How to structure your website for AI

You already know about Generative Engine Optimization (GEO) or you probably wouldn’t be reading this post. But you’re here to learn more: How LLMs (like ChatGPT, Gemini, Perplexity and others) are crawling your website. How to structure your website so your content and brand show up on LLMs. And more.

In this article you’ll learn how to organize your website in order to make it AI-friendly. And hopefully a bit more context that helps you understand what’s happening behind the scenes and why structuring your website for AI is important.

What isn’t in this post

I’m not going to cover what content to write or how to write content. Sure, LLMs pull website content that provides clear answers. And, yes, LLMs gather all sorts of content. (But not all content is the same. Content that will stand out includes facts, comparisons, original insights, and many other types of non-generic information.) Still, On Marketing has lots of information on content for LLMs. This just ain’t it.

Just the facts

A couple relevant pieces of information ahead of the deep-dive:

  • LLMs do the best at crawling sites with clear structure.
  • It is possible to create websites that serve human visitors (your customers), traditional web crawling bots (like Googlebot), and AI model agents (LLM scrapers).

LLMs Are Crawling My Website?

Yes. Large language models rely on web crawlers much like search engines do, but for different purposes. OpenAI’s GPTBot scans publicly available content to improve model accuracy and capabilities. Similarly, ClaudeBot, used by Anthropic, crawls the web while (sometimes) respecting robots.txt directives. In addition to their own proprietary bots, many LLMs are trained on massive datasets from tools like Common Crawl, which operates like a traditional web crawler but gathers broad swaths of internet data specifically for AI training.

What is Common Crawl and Why is it important?

Common Crawl is a non-profit organization that regularly crawls billions of web pages and makes the data freely available to the public. Its massive datasets include raw HTML, metadata, link graphs, and extracted text from across the internet, stored in WARC (Web ARChive) format. These datasets are hosted on Amazon S3 and are widely used by researchers, developers, and AI companies to train models, analyze web trends, or build custom tools.

For LLMs, Common Crawl is a foundational resource. Many LLMs are trained on filtered and cleaned subsets of its data due to its vast scale, diversity of content, and open accessibility. While the raw data includes a mix of high- and low-quality content, it offers an invaluable, language-rich snapshot of the internet. As a result, Common Crawl plays a central role in shaping the capabilities of today’s most powerful AI systems.

What happens after an LLM bot crawls my site?

Maybe we’re getting ahead of ourselves, but it’s good to understand the big picture.

The bots mentioned above have frameworks and processes for collecting and then digesting information. Here’s how it works:

  1. First, as mentioned above, the AI bots collect raw data (i.e. text, HTML, metadata)
  2. Then, this data is processed (and filtered and structured) during a “fine tuning” phase. (Maybe I’ll cover this in a future post.) But the key is that it’s in this phase that a bunch of junk is removed. Gone is a bunch of useless HTML like nav bars, CSS and that sort of thing. Also, low-quality content is removed. So be sure to read more about content for GEO.
  3. Next all your site content gets tokenized by the AI model, which uses these tokens for training (also a really cool topic, but one for another day.)
  4. And finally the AI model embeds your content into its framework for retrieval.

The takeaway from all this is you want to make it easy for the AI bot to access and conceptualize your content. And how your website is built is a key part of this.

Structured Data

You want Common Crawl, ClaudeBot, and GPTBot and the others to not just crawl your website, but also “understand” what they’re gathering. (No, LLMs don’t actually understand, but that’s a different discussion.) Structured data is a standardized way of tagging content on your website so that search engines and AI models can better understand what each element means, like identifying a product’s price, an article’s author, or the steps in a how-to guide.

While it doesn’t directly improve rankings, it helps enhance how your content appears in search results (like rich snippets) and can make it easier for AI systems to categorize and summarize your site.

You or your web developer are probably familiar with structured data and the JSON-LD format (the format recommended by Google). But if not, you can read more at json-ld.org.

Structure and Internal Links (Basics)

I’ve already said that a well-structured website makes it easier for both users and search engines (AI) to understand and navigate your content. To achieve this, organize your pages into clear categories and subcategories, keep the navigation menu simple, and use logical internal links to connect related content. For example, a blog about running shoes should link to related posts like choosing the right running shoes—not to unrelated topics. This is pretty basic stuff, and really not GEO-specific. But it’s a good reminder.

You should also submit a sitemap through Google Search Console so that search engines can index your pages more efficiently. Regularly check for and fix broken links, as they can harm both the user experience and your rankings. Overall, a clean structure and smart internal linking improve your site’s discoverability and boost its position in both traditional search results and on AI search.

Structure (More Advanced)


Ok, that was the basics. Here are some more advanced ideas:

Use Semantic HTML and Schema Markup

To help AI understand your content better, use semantic HTML tags (like <article> and <section>) and structured data from Schema.org. It’s like labeling stuff in your closet, it tells search engines exactly what each piece of content is about. If you’ve got a product review, mark it as a Review, and if you’re answering questions, use FAQ. It makes your content easier for LLMs to parse and feature.

Build Smart Internal Link Structures

Instead of randomly linking pages together, group them into topics. Have one main page (the hub) that links to all the in-depth pages (the spokes) and make sure they link back to each other in a way that makes sense. AI tools love this kind of structure because it shows you’re a real expert in that topic, not just tossing out scattered info.

Use Navigation That Feels Like a Q&A Journey

Think about how people search for stuff, usually starting broad, then getting more specific. Your site should work the same way. Start with a general category, then lead users (and AI) down a trail of more detailed content. If someone lands on a page about running shoes, link them to “best for flat feet,” then to “how to care for them,” and so on.

Break Up Your Sitemap by Topic

If you’ve got a lot of content, don’t dump it all into one sitemap. Split it up by topic, like gear, blog posts, reviews, etc. This helps search engines index your site more efficiently and shows AI engines how your site is organized. It’s also great for updating sections without refreshing everything at once.

Don’t Confuse the Bots with Weird Layouts

Keep your menus and links clear and consistent. Avoid hiding important links in buttons or loading them only with JavaScript. Also, make sure your breadcrumbs match your URLs. The easier it is for AI to follow your structure, the more likely it is to surface your content in answers and summaries.



On the topic of how to structure your website for AI, there’s lots more to say. If you want to connect with us, it’s easy. Whether to say hi or reach out about an opportunity, we’re here and always excited to talk about GEO, AI search, and how to help marketing teams in a new world of LLMs.