Technical SEO

Robots.txt

Shahid Maqbool

By Shahid Maqbool
On Apr 19, 2023

Robots.txt

What is Robots.txt?

The robots.txt file is a simple text document placed in the root directory of a website that provides instructions to search engine crawlers about which pages or sections of the site they are allowed to crawl.

Its primary function is to manage crawler access and reduce unnecessary server load by blocking bots from accessing specific content.

What Robots.txt Does and Doesn’t Do

While robots.txt tells crawlers where they can and cannot go, it does not prevent indexing on its own. To ensure a page isn’t indexed, you’ll need to use the noindex meta tag, apply HTTP headers, or secure the page with authentication.

Search engines may still index pages disallowed in robots.txt if those pages are linked externally and not protected by other means.

How Robots.txt Fits into the Robots Exclusion Protocol (REP)

Robots.txt is part of the Robots Exclusion Protocol (REP), a standard that defines how web crawlers interact with websites. It is not mandatory—while most major search engines like Google and Bing respect robots.txt rules, malicious or lesser-known bots may ignore them entirely.

Robots.txt File Format: Syntax Overview

A basic robots.txt file follows this structure:

User-agent: [name of crawler]

Disallow: [URL path you want to block]

You can define multiple sets of rules for different bots. For example:

User-agent: *

Allow: /wp-content/uploads/

Disallow: /wp-admin/

User-agent: msnbot

Disallow: /

The asterisk * is a wildcard indicating rules apply to all crawlers.

Limitations of Robots.txt

Despite its usefulness, robots.txt has several limitations:

  • Not enforced: Crawlers are not legally required to follow it.

  • Not universal: Some bots disregard robots.txt rules altogether.

  • Not foolproof for indexing: Disallowed URLs can still appear in search results if linked elsewhere.

  • Syntax interpretation varies: Different crawlers may interpret rules differently.

  • Conflicting directives: Improper combinations of allow/disallow can lead to unexpected results.

For complete control, combine robots.txt with meta tags or server-level restrictions.

Be careful when combining crawling and indexing directives. Mixing multiple rules in your robots.txt file can sometimes create conflicts, causing search engines to misinterpret your intentions. This may result in certain pages being crawled or indexed unexpectedly. To prevent such issues, it’s essential to understand how to properly structure and coordinate your crawling and indexing instructions.

Technical robots.txt syntax

User Agents

User agents are browsers, bots, plugins, or software applications that access websites and deliver content to users through web technologies. Each browser or crawler has a distinct user agent string, allowing webmasters to define specific instructions for them in the robots.txt file.

Although there are hundreds of user agents, some of the most commonly encountered include:

  • Googlebot (Google)

  • Bingbot (Bing)

  • MSNBot (Microsoft)

  • Slurp (Yahoo)

  • DuckDuckbot (DuckDuckGo)

  • Baiduspider (Baidu)

To target all user agents with a single rule, use the asterisk symbol (*) as a wildcard:

User-agent: *

Allow: /

If you want to block all crawlers except a specific one (e.g., Slurp), you can structure your file like this:

User-agent: *

Disallow: /

User-agent: Slurp

Allow: /

You can also provide custom directives for multiple user agents in the same file. Each user agent's rules operate independently and won’t impact others unless specified.

Directives

Directives are instructions provided in the robots.txt file that guide user agents on how to interact with your site. Here are the most commonly used directives supported by various search engines:

Disallow: This directive prevents a crawler from accessing specific pages, directories, or files.

Example: Block Slurp from accessing your blog section:

User-agent: slurp

Disallow: /blog

Allow: The Allow directive grants permission to crawl specific content, even within a disallowed directory.

Example: Block all blog posts from Slurp, except for one specific post:

User-agent: slurp

Disallow: /blog

Allow: /blog/example-post

Sitemap
This directive points crawlers to your website’s XML sitemap, which lists URLs you want indexed. It’s typically added at the top or bottom of the robots.txt file.

Example for a WordPress site:

User-agent: *

Disallow: /wp-admin/

Sitemap: https://www.yourwebsite.com/post-sitemap.xml

Crawl-delay: This tells crawlers to wait a specified number of seconds between requests to prevent server overload.

Example: Apply a 5-second delay for all bots:

User-agent: *

Crawl-delay: 5

Google does not support the Crawl-delay directive. However, Bing and Yandex do. For Google, crawling frequency should be managed via Google Search Console.

Noindex: This directive was used to prevent Google from indexing certain pages:

User-agent: Googlebot

Noindex: /example-blog/

However, as of September 1, 2019, Google no longer supports the Noindex directive in robots.txt. To prevent a page from being indexed, you should use the <meta name="robots" content="noindex"> tag within the HTML of the page instead.

Robots.txt vs. Meta robots tags

Robots.txt

Meta Robots Tag

Located in root directory

Placed in <head> of HTML

Blocks crawling (not indexing)

Controls indexing and link following

Easier for site-wide rules

Better for page-specific instructions

For deindexing, Google recommends using meta robots tags rather than disallowing with robots.txt.Meta robots tags and robots.txt are different in their characteristics but perform the same functions.

Robots.txt is added to the root directory of a website, while meta robots tags are located in the <head> section of a website.

Robots.txt Examples

Block all crawlers from entire site:

User-agent: * Disallow: /

Allow all bots full access:

User-agent: * Disallow:

Block one bot (e.g., msnbot):

User-agent: msnbot Disallow: /

Allow Googlebot only:

User-agent: Googlebot Disallow:

Block PDF file for all crawlers:

User-agent: * Disallow: /file.pdf

How to create a robots.txt file?

Method 1: Manually via FTP or File Manager

  1. Open a text editor.

  2. Enter your user-agent and disallow/allow rules.

  3. Save the file as robots.txt.

  4. Upload to the root directory (e.g., /public_html/).

Method 2: Via WordPress with Yoast SEO

  1. Install the Yoast SEO plugin.

  2. Go to Tools > File Editor.

  3. Add your rules and save the file.

Example:

User-agent: *

Disallow: /wp-admin/

Sitemap: https://www.yourwebsite.com/sitemap.xml

Where to Place Robots.txt

The robots.txt file must be accessible at:

https://www.yourdomain.com/robots.txt

Each subdomain (e.g., blog.yoursite.com) requires its own file.


How to Test Robots.txt

To validate your robots.txt file:

Use Google Search Console > Crawl > robots.txt Tester

Or test manually by visiting:

https://www.yourdomain.com/robots.txt

Ensure the file loads and is readable in the browser.

Fixing Robots.txt Errors

Common mistakes include:

Blocking entire site unintentionally:

User-agent: * Disallow: /

Blocking important content:

Disallow: /product-page/

To fix:

  • Replace disallow rules with allow or delete them.

  • Retest with the Google robots.txt tester.

  • Use proper formatting and syntax.

Pros and Cons of Using Robots.txt

Advantages:

  • Controls crawler access to reduce server load

  • Helps manage crawl budget for large sites

  • Blocks private or duplicate content

  • Prevents crawling of admin areas or scripts

Disadvantages:

  • Doesn’t stop indexing of disallowed pages if externally linked

  • Link equity (PageRank) may be lost if pages are blocked

  • Rules are not always respected by all bots

Robots.txt and Sitemaps: A Combined Strategy

Linking your XML sitemap inside robots.txt improves discoverability and crawl efficiency.

Example:

Sitemap: https://www.example.com/sitemap.xml

This synergy helps search engines better understand your site’s architecture and index it more effectively.

Role of Robots.txt in Bot Management

While robots.txt helps filter well-behaved bots, malicious bots may ignore it. For stronger bot control, consider using tools like:

  • Cloudflare Bot Management

  • DataDome

  • SpamTitan

These services help block abusive or harmful crawling activities.

Final Thoughts

The robots.txt file is a powerful yet simple tool that helps you guide how search engine bots interact with your website. Used correctly, it can improve crawl efficiency, protect sensitive content, and support your SEO strategy.

However, use caution—misconfiguration can harm your site’s visibility. For best results, combine robots.txt with meta tags and professional SEO guidance to control both crawling and indexing effectively.

Related Articles

Leave a reply
All Replies (0)