Layer 1 - Infrastructure

robots.txt: the file that decides which AI systems can crawl you.

Most businesses have never looked at their robots.txt file. Many are unintentionally blocking OAI-SearchBot, GPTBot, PerplexityBot, ClaudeBot, or related agents - or publishing no clear crawl policy at all. A blocked crawler cannot evaluate or cite a business from the pages it cannot access.

Get Your Free Audit ->

The audit identifies whether this service is needed for your site.

The Problem

You may be blocking ChatGPT without knowing it.

robots.txt is a plain text file at yourdomain.com/robots.txt. It tells web crawlers which parts of a site they are allowed to access. Most business websites have a robots.txt created automatically by their CMS on launch - and most of those defaults either block all crawlers with Disallow: / under User-agent: *, or silently block specific AI bots because those bots did not exist when the file was created.

OpenAI documents multiple agents with different roles: OAI-SearchBot for search discovery, GPTBot for training, and ChatGPT-User for user-triggered fetches. Anthropic and Perplexity document similar agent families. If these relevant agents are disallowed in robots.txt, access becomes inconsistent and the business is harder to crawl, verify, or cite.

Public search discovery is a separate but related issue. Google and Bing index coverage are useful proxy checks because AI systems often cite pages that are already discoverable in major web indexes. That is not a guarantee of inclusion in any one AI product, but total absence from public indexes is a warning sign that should be fixed.

Removing crawler blocks and publishing a clean sitemap improves discovery signals. It does not guarantee placement in ChatGPT, Gemini, Claude, or Perplexity on a fixed timeline.
What Gets Implemented

Full AI crawler access configuration.

1
robots.txt full rewrite

All relevant directives reviewed and added explicitly where appropriate: OAI-SearchBot, GPTBot, ChatGPT-User, PerplexityBot, Google-Extended, ClaudeBot, Claude-SearchBot, Claude-User, anthropic-ai, cohere-ai, CCBot, and Bingbot. Wildcard conflicts corrected.

2
Sitemap.xml validation and submission

sitemap.xml checked for validity, completeness, and correct lastmod values. Submitted to Google Search Console and Bing Webmaster Tools.

3
Canonical tag audit

3-5 key pages checked for correct self-referencing canonical tags. HTTP/HTTPS and www/non-www consistency verified.

4
Public discovery checks

Google and Bing site: searches used as public discovery proxies. If coverage is weak, sitemap submission and crawl follow-up steps are documented.

5
Redirect chain check

HTTP -> HTTPS redirect verified as single-hop. Chains of 2+ hops flagged.

6
Accidental noindex correction

HTML and response headers checked for unintentional noindex directives on important pages. Corrected if found.

ROBOTS.TXT EXAMPLEtxt
# VERIS - AI Crawler Configuration
# Updated: 2026-03-31

User-agent: *
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: cohere-ai
Allow: /

User-agent: CCBot
Allow: /

User-agent: Bingbot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml
Scope Boundary

What this service does not include.

  • Content production or page creation to be indexed
  • Sitemap creation (generation is included; custom sitemap architecture is not)
  • Google Search Console account creation
  • Bing Webmaster Tools account creation (setup is guided; account requires client email)
  • Paid search indexation services
  • Any changes to page content, design, or structure
How To Verify This

Check your robots.txt in 10 seconds.

Open your browser and navigate to yourdomain.com/robots.txt. Look for lines containing OAI-SearchBot, GPTBot, PerplexityBot, ClaudeBot, and any Google-Extended policy you choose to publish. If you see "Disallow: /" under any of the relevant agents, that crawler is blocked. If those user-agent lines do not appear at all, the crawlers are falling back to the wildcard rule, which may or may not allow them depending on what that wildcard says.

Verification Tool
Direct browser URL check
https://yourdomain.com/robots.txt

Step 1: Open the URL. Step 2: Find OAI-SearchBot, GPTBot, PerplexityBot, ClaudeBot, and any Google-Extended policy. Look for: Allow / (not Disallow /).

Google or Bing (site: search)
https://www.google.com or https://www.bing.com

Search: site:yourdomain.com. Zero results suggests a crawl or indexation problem worth fixing. It is a public discovery proxy, not a standalone guarantee of AI visibility.

Common questions about AI crawler configuration.

No. robots.txt is a voluntary protocol - it instructs well-behaved crawlers. It has no effect on malicious scrapers. Allowing AI crawlers does not expose private pages, admin areas, or sensitive data. Those pages should be protected by authentication, not robots.txt.

Yes. Most WordPress SEO plugins let you add custom directives manually. VERIS implements the AI crawler rules through the appropriate method for your CMS without conflicting with plugin settings.

Googlebot is Google's standard search crawler for Google Search. Google-Extended is a separate publisher control Google documents for certain Gemini and Vertex AI generative uses. It does not replace Googlebot, so VERIS reviews normal search crawl access separately from any Google-Extended policy you want to publish.

There is no guaranteed timeline. Search crawlers may revisit in days or weeks, while model refreshes and citation behavior can take longer. VERIS treats Bing and Google index checks as public discovery proxies, not as a fixed promise that any one AI product will start citing a site on a specific date.

An absent robots.txt is generally treated as allowed by default. VERIS still recommends publishing an explicit file because it makes crawler policy auditable, reduces ambiguity for site owners, and includes the sitemap directive that improves crawl hygiene.

Find out which AI crawlers can and cannot access your site.

The audit checks your robots.txt against published AI and search agents and reviews your public discovery signals in major web indexes.

This site uses cookies to track anonymous usage. See our Privacy Policy and Cookie Policy.