Optimise your Content for Inclusion in LLM Public Datasets

Optimise your Content for Inclusion in LLM Public Datasets

In the modern world of web content marketing one thing is becoming clear – we no longer write just for humans. Like it or not, today’s most powerful audience is machine-based – Large Language Models (LLMs) like ChatGPT, Gemini, Claude and Grok. These models don’t just summarise the web. They define what gets seen, cited, and surfaced across billions of user interactions.

Therefore, if your content isn’t being sourced by these models, it’s effectively invisible to the next generation of search and discovery. The rules of SEO still matter, but there’s a new game in town: Artificial Intelligence Optimisation (AIO) or LLMO – Large Language Model Optimisation.

This guide breaks down how to get your content into the datasets that LLMs are trained on, how to structure it so it gets picked up and cited, and how to future-proof your work for an AI-first content landscape. If your content strategy isn’t adapting, it’s dying.

Let’s fix that.

CC BY and Open Licenses

By using a permissive licence like those used at Creative Commons, your content sends a signal that it may be reused for reference without legal friction. 

As LLMs have a habit of skipping over any content that is copyrighted or ambiguous, applying an open licence greatly improves the chances of your content being sourced because it specifically promotes its reuse and redistribution. 

The permissive licence acts like a green flag for AI trainers. It signals that your content is safe to use and that it is technically and legally accessible for inclusion in any AI training pipelines.

Data Scraping

Next, you need to make sure your content is eligible for scraping and inclusion in LLM training datasets. Therefore, your content needs to be public, crawable and hosted on popular platforms that are accessed by the leading AI trainers. Here are some of the most effective ways you can structure and distribute your content effectively:

  • Always use public platforms which are well known and crawl friendly. GitHub (for code), ArXiv (for research), Stack Overflow (for technical Q&A), Medium, Quora, Reddit and Wikipedia remain some of the most popular and readily accessible destinations on the market today.
  • Avoid ‘content-gating’ and ensure none of your content is placed behind paywalls e.g. log-ins or exposed to any restrictive terms of service. Always ensure that your content is free to read and easy to access.
  • Enable ‘crawling’ by making sure the site hosting your content allows indexing by search engines via permissive files such as robots.txt
  • Always use clear content structure such as headings, alt text and metadata to improve machine readability.

By using these steps you increase the chances of your content being included in public datasets that are then used by LLMs when augmenting and training their data.

Best Practices for Discoverability

There are a number of key technical processes that you should adopt to make sure your content is being seen. Let’s take a closer look.

Use Clean HTML

Firstly, it’s imperative that you use clean HTML and semantic markup at all times. This means structuring your content so that it is easily understandable by humans and machines at all times. To do this, make sure you incorporate the following:

  • Always use heading tags properly e.g. [H1] for the main title followed by [H2] and [H3] etc. for subsequent subtitles.
  • Make use of semantic tags e.g. <article>, <section>, <nav>, and <footer> to indicate the role of each block of text.
  • Be sure to include descriptive e.g. <title>, <meta name=”description”> and structured data to help search engines and data crawlers understand the context.

Remember, clean and well structured HTML increases the likelihood of AI crawlers being able to parse and index your content accurately thereby making it more likely to be used in AI training and retrial systems.

Schema tags for articles, products, reviews, etc.

Secondly, by using schema.org tags, you can help AI understand the meaning behind your content rather than it being just words on a page. For example, an article that uses the ‘article’ schema helps to define the author, publication date, headline and copy. 

Furthermore, content that uses the ‘product’ schema is able to communicate data such as label price, availability, reviews and ratings. It is also possible to even design your own scheme type for individual classifications.

Minimise clutter (popups, excessive JS, gated forms)

Thirdly, it is imperative that your content keeps clutter (popups, excessive JS, gated forms) to an absolute minimum. By doing so, your content is easy to crawl through and ensures that search engines and AI scrapers can access and upload your content quickly and smoothly.

Use Canonical URLs to Avoid Duplication Issues

Finally, be sure to use canonical URLs to avoid duplication issues. Canonical URLs tell search engines and AI crawlers which version of a page is the original or preferred version which is especially helpful if you have duplicated content or content that is very similar across numerous URLs. 

By using canonical URLs the content writer is able to ensure the right content is used rather than being overlooked or ignored completely.

SEO Meets LLM Optimisation

AI LLMs typically use high ranking content over that which is further down search results. Therefore, making sure your content ranks highly on a SEO basis is very important. 

Furthermore, clear, concise, factually correct and well structured content is equally as palatable for LLMs as it is for humans. The use of natural, question based language (FAQs, how-tos, what is…?) has also been proven to be more effective than bland, beige or overly colourful writing.

Finally, by favouring evergreen data you can ensure your content remains relevant over a much longer period of time. This kind of content is more likely to be crawled, indexed and used in LLM training models because of its consistency and long-term value negating the need for frequent updates.

Building Credibility and Authority

It goes without saying that humans take greater notice of information which comes from credible sources. Well, the same can be said of machines too. It is important therefore to ensure your content carries weight and is used by reputable and authoritative sources.

There are many ways you can achieve this but one of the most efficient methods is to get cited and referenced by other sites already known to be ‘high-authority’ e.g. BBC, Reuters, The New York Times, The Guardian, The Verge. It has been proven that LLMs favour content that comes from such sites.

Another technique is to incorporate links and quotes of research backed or thought leadership content in well known and crawable publications. Some of the most popular, accessible and well considered platforms available today include Medium, Dev.to, Substack and HackerNoon.

Finally, recent reverse engineered research from Neil Patel has identified 5 core factors which determine whether LLMs like ChatGPT, Gemini, Grok etc. recommend your brand –

Brand Mentions – The more your brand is mentioned in forums, blogs and reviews – the better.

Reviews – Third party reviews help to build trust and increase reputation.

Relevancy – Good SEO still counts and carries as much weight as it ever did.

Age – LLMs have a preference for established companies so if you are relatively new, showcasing your historical experience will carry weight.

Recommendations – Being listed on round-ups, best of’s and reviews can directly influence LLM output.

Link Building and Cross-Publishing

To increase the chances of your content being used in AI LLM datasets it is important to focus on increasing your content visibility and credibility signals. By including more inbound links from reputable sites you are able to boost your domain’s authority which makes your content more discoverable and prioritised by web crawlers.

To increase credibility further, you should also syndicate or cross publish your content on AI friendly platforms such as the aforementioned GitHub (for code), ArXiv (for academic work), and Medium (for general articles). By doing so you are making sure your content lives exactly where AI trainers are already looking.

Finally, by having your content quoted or, even better, published in high-traffic newsletters or major blogs, it is possible to extend your content’s reach and improve the chances of your content being used in future AI LLM updates.

Using AI-Specific Distribution Channels

To optimise your content even further in the quest for LLM recognition, you should also consider listing your work in public datasets such as Papers with Code, Kaggle, or GitHub repositories. This kind of platform is frequently used by AI developers and model trainers and if your work is readily accessible there is a higher chance it will be absorbed by LLMs.

You can also go one step further by making sure your content is used to contribute to wikis, open source knowledge bases and collaborative forums like Stack Exchange. Even integrating your content into Reddit AMAs helps your content become part of active, crowd sourced data that AI models use for reference.

Finally, you could also submit your content to dataset focussed projects such as LAION or Common Crawl who aggregate large amounts of publicly available data which is then used to train LLM AI models.

Monitoring and Feedback

As anyone working in the modern age will tell you, the monitoring and feedback loop is essential to the success of any business and in the case of AI friendly content, the same applies. Whilst the tools that tell you if your content was used in AI training are not yet available, it is possible to utilise a number of shortcuts for this purpose.

For example, you can test AI models by asking specific questions which you know will reference your data. The most efficient way to do this is asking AI to search for specific phrases or novel and niche subjects.

You can also use tools such as Perplexity AI or You.com to show citations which can then be monitored to show if your content is being sourced.

Finally, you can also set-up alerts for backlinks or specific mentions to see if any AI-generated content is referencing your original work.

Advanced Optimisation for AI-Enhanced Discovery

As we can see, AI models have a direct influence on how users discover content. Therefore it’s worth going beyond just the basics. 

Let’s take a look at a few additional techniques that further improve your chance of being selected, cited, or surfaced by LLM AI systems like ChatGPT, Gemini and Grok.

Optimise for Featured Snippets and Direct Answers

LLMs often use content that ranks in Google’s featured snippets or ‘People also ask’ boxes. Structuring your content using Q&A formats, numbered lists, and concise summaries will help improve visibility in both search engines and AI interfaces alike.

Use Behaviour Insights to Refine Structure

Tools like Microsoft Clarity or Smartlook help analyse how users engage with content. Heatmaps and scroll-depth tracking can also reveal which areas hold attention and can help you to improve clarity, formatting, and relevance.

Promote and Share Strategically

The more eyes on your content, the more likely it is to be linked, indexed and discovered by both humans and machines. B2B marketing consultant Tom Pick recommends leveraging social media, cross-posting and digital PR to widen your reach and make your content stay relevant.

Verifying and Indexing Your Website in ChatGPT’s Knowledge Graph

A crucial step toward making your content discoverable by large language models (LLMs) is verifying your website within the OpenAI ecosystem. Through ChatGPT’s Custom GPT Builder feature – available to all ChatGPT Plus users – you can link and validate your domain, effectively signalling to OpenAI’s infrastructure that your site is an authoritative source of information.

The verification process is straightforward:

  1. Access the Custom GPT Builder: In ChatGPT, navigate to your profile > Settings > Builder Profile. This area allows you to create custom GPTs and associate them with your identity and online presence.
  2. Add Your Domain: Within the Builder Profile, select “Verify New Domain”. Enter your domain name, and the system will generate a TXT record – a string of code used for DNS verification.
  3. Configure Your DNS Settings: Log into your domain registrar (e.g., GoDaddy, Google Domains, Hostinger) and locate the DNS management section. Add a new TXT record using the value provided by OpenAI. Leave other fields such as TTL at their default settings.
  4. Confirm Verification: Return to ChatGPT and click “Check” to verify the domain. If successful, your website will be indexed in OpenAI’s Knowledge Graph.

This establishes a direct connection between your owned content and the LLM’s underlying data infrastructure.

It’s a foundational move for any brand, publication, or creator looking to increase their visibility in AI-powered environments. Once your domain is verified, OpenAI’s models are more likely to reference your content in both custom GPTs and general usage.

Future-Proofing Your Content

Staying relevant in an AI-driven content landscape is crucial. By future-proofing your work with intentional strategies it is possible to keep your content relevant and accessible. Let’s take a look at how you can achieve this.

First and foremost, you should focus on creating unique, high-value content. Deep analysis, original research, and expert insights all stand out because AI models always prioritise authoritative and distinctive sources over generic material.

Next, it is essential that your content is structured for AI understanding. Always use clear, semantic HTML headings, well-organised sections, and Schema.org markup. This not only helps you rank well with search engines but it also makes it easier for AI systems to parse and accurately source your content.

Always prioritise evergreen topics with lasting relevance. It is clear that evergreen topics attract attention over time and retain higher value in AI training datasets. Therefore, it is good practice to regularly revisit and update your content in order to ensure that it remains fresh and competitive without becoming static.

It also pays to be aware of how AI may summarise or repurpose your work. Always break complex ideas into shorter sections that can be easily extracted and reassembled. This increases the likelihood of your content being used in AI applications.

Be sure to leverage analytics and AI tools in order to monitor how your content performs. These tools can help identify any knowledge gaps you may have missed whilst spotting emerging trends. Continuous iteration based on data-driven insights will ensure your content evolves alongside AI capabilities.

Building strong brand and domain authority is also very important. AI models favour sources with high credibility, so invest in backlink strategies, consistent branding, and active community engagement to reinforce your authority and visibility.

Finally, it pays to stay informed and proactive about legal and ethical developments. Be sure to keep track of evolving copyright laws, licensing options and industry best practices to ensure your content remains eligible for AI inclusion and that you retain control over its use.

Overall, it is best practice to prepare for an age of AI-native content marketing where content isn’t just read by humans, but it is also interpreted, summarised and distributed by machines.

Conclusion

As you can see, AI LLMs have the power to become the default interface for accessing, interpreting and distributing knowledge across the world. However, the difference is that the rules of engagement are being written in real-time.

One thing is becoming clear though – visibility is no longer just about search rankings. Visibility in the AI driven world is about being included in datasets, retrieval engines and the generative outputs that billions of users now rely on.

As you can see, we are entering an era where any online content produced is more likely to be read by machines rather than humans. Furthermore, the influence of content is determined not by clicks but how AI cites, paraphrases, and reuses your insights. 

This means a strategic shift in thinking is needed for not only content writers but also brands, thought leaders and publishers. It is therefore crucial to understand how content is sourced and used by LLMs.

As SEO defined the last ten years, now AIO (Artificial Intelligence Optimisation) is set to shape the next. By ensuring your content aligns and takes advantage of how LLMs operate is key to not just the success but more importantly, the survival of your content marketing function.

Here at The Bubble Co., we help forward thinking brands optimise their content marketing for both humans and intelligent systems. So, whether you are building LLM friendly architecture or navigating the grey ethical areas of content discovery, we’ll help you on your journey.

Reach out to us today to help your project optimise for an AI driven world.

This monumental shift is happening right now. Make sure you don’t get left behind.

Source: this blog was originally published on the Take3 website at take3.io

The latest in the world of digital marketing

The Ethical and Legal Landscape of Optimising Content for AI

It goes without saying that the intersection of content management and AI presents a complex web of ethical and legal considerations of which content creators, marketers, and platform owners must be aware. As AI driven LLMs increasingly ingest and repurpose massive amounts of digital content, a number of fundamental questions arise: Let’s find out. The […]

Learn More

Get a Free Digital Audit to Help Elevate
Your Digital Marketing Success.

Get Ready for your Free Audit

Get a Free Digital Audit to Help Elevate
Your Digital Marketing Success.

Get Ready for your Free Audit