What is Robots.txt?
The robots.txt
file is a simple text document placed in the root directory of a website that provides instructions to search engine crawlers about which pages or sections of the site they are allowed to crawl.
Its primary function is to manage crawler access and reduce unnecessary server load by blocking bots from accessing specific content.
What Robots.txt Does and Doesn’t Do
While robots.txt tells crawlers where they can and cannot go, it does not prevent indexing on its own. To ensure a page isn’t indexed, you’ll need to use the noindex
meta tag, apply HTTP headers, or secure the page with authentication.
Search engines may still index pages disallowed in robots.txt if those pages are linked externally and not protected by other means.
How Robots.txt Fits into the Robots Exclusion Protocol (REP)
Robots.txt is part of the Robots Exclusion Protocol (REP), a standard that defines how web crawlers interact with websites. It is not mandatory—while most major search engines like Google and Bing respect robots.txt rules, malicious or lesser-known bots may ignore them entirely.
Robots.txt File Format: Syntax Overview
A basic robots.txt file follows this structure:
User-agent: [name of crawler]
Disallow: [URL path you want to block]
You can define multiple sets of rules for different bots. For example:
User-agent: *
Allow: /wp-content/uploads/
Disallow: /wp-admin/
User-agent: msnbot
Disallow: /
The asterisk *
is a wildcard indicating rules apply to all crawlers.
Limitations of Robots.txt
Despite its usefulness, robots.txt has several limitations:
Not enforced: Crawlers are not legally required to follow it.
Not universal: Some bots disregard robots.txt rules altogether.
Not foolproof for indexing: Disallowed URLs can still appear in search results if linked elsewhere.
Syntax interpretation varies: Different crawlers may interpret rules differently.
Conflicting directives: Improper combinations of allow/disallow can lead to unexpected results.
For complete control, combine robots.txt with meta tags or server-level restrictions.
Be careful when combining crawling and indexing directives. Mixing multiple rules in your robots.txt file can sometimes create conflicts, causing search engines to misinterpret your intentions. This may result in certain pages being crawled or indexed unexpectedly. To prevent such issues, it’s essential to understand how to properly structure and coordinate your crawling and indexing instructions.
Technical robots.txt syntax
User Agents
User agents are browsers, bots, plugins, or software applications that access websites and deliver content to users through web technologies. Each browser or crawler has a distinct user agent string, allowing webmasters to define specific instructions for them in the robots.txt
file.
Although there are hundreds of user agents, some of the most commonly encountered include:
Googlebot (Google)
Bingbot (Bing)
MSNBot (Microsoft)
Slurp (Yahoo)
DuckDuckbot (DuckDuckGo)
Baiduspider (Baidu)
To target all user agents with a single rule, use the asterisk symbol (*
) as a wildcard:
User-agent: *
Allow: /
If you want to block all crawlers except a specific one (e.g., Slurp), you can structure your file like this:
User-agent: *
Disallow: /
User-agent: Slurp
Allow: /
You can also provide custom directives for multiple user agents in the same file. Each user agent's rules operate independently and won’t impact others unless specified.
Directives
Directives are instructions provided in the robots.txt file that guide user agents on how to interact with your site. Here are the most commonly used directives supported by various search engines:
Disallow: This directive prevents a crawler from accessing specific pages, directories, or files.
Example: Block Slurp from accessing your blog section:
User-agent: slurp
Disallow: /blog
Allow: The Allow
directive grants permission to crawl specific content, even within a disallowed directory.
Example: Block all blog posts from Slurp, except for one specific post:
User-agent: slurp
Disallow: /blog
Allow: /blog/example-post
Sitemap
This directive points crawlers to your website’s XML sitemap, which lists URLs you want indexed. It’s typically added at the top or bottom of the robots.txt
file.
Example for a WordPress site:
User-agent: *
Disallow: /wp-admin/
Sitemap: https://www.yourwebsite.com/post-sitemap.xml
Crawl-delay: This tells crawlers to wait a specified number of seconds between requests to prevent server overload.
Example: Apply a 5-second delay for all bots:
User-agent: *
Crawl-delay: 5
Google does not support the Crawl-delay directive. However, Bing and Yandex do. For Google, crawling frequency should be managed via Google Search Console.
Noindex: This directive was used to prevent Google from indexing certain pages:
User-agent: Googlebot
Noindex: /example-blog/
However, as of September 1, 2019, Google no longer supports the Noindex
directive in robots.txt
. To prevent a page from being indexed, you should use the <meta name="robots" content="noindex">
tag within the HTML of the page instead.
Robots.txt vs. Meta robots tags
Robots.txt | Meta Robots Tag |
---|---|
Located in root directory | Placed in |
Blocks crawling (not indexing) | Controls indexing and link following |
Easier for site-wide rules | Better for page-specific instructions |
For deindexing, Google recommends using meta robots tags rather than disallowing with robots.txt.Meta robots tags and robots.txt are different in their characteristics but perform the same functions.
Robots.txt is added to the root directory of a website, while meta robots tags are located in the <head> section of a website.
Robots.txt Examples
Block all crawlers from entire site:
User-agent: * Disallow: /
Allow all bots full access:
User-agent: * Disallow:
Block one bot (e.g., msnbot):
User-agent: msnbot Disallow: /
Allow Googlebot only:
User-agent: Googlebot Disallow:
Block PDF file for all crawlers:
User-agent: * Disallow: /file.pdf
How to create a robots.txt file?
Method 1: Manually via FTP or File Manager
Open a text editor.
Enter your user-agent and disallow/allow rules.
Save the file as
robots.txt
.Upload to the root directory (e.g.,
/public_html/
).
Method 2: Via WordPress with Yoast SEO
Install the Yoast SEO plugin.
Go to Tools > File Editor.
Add your rules and save the file.
Example:
User-agent: *
Disallow: /wp-admin/
Sitemap: https://www.yourwebsite.com/sitemap.xml
Where to Place Robots.txt
The robots.txt file must be accessible at:
https://www.yourdomain.com/robots.txt
Each subdomain (e.g., blog.yoursite.com) requires its own file.
How to Test Robots.txt
To validate your robots.txt file:
Use Google Search Console > Crawl > robots.txt Tester
Or test manually by visiting:
https://www.yourdomain.com/robots.txt
Ensure the file loads and is readable in the browser.
Fixing Robots.txt Errors
Common mistakes include:
Blocking entire site unintentionally:
User-agent: * Disallow: /
Blocking important content:
Disallow: /product-page/
To fix:
Replace disallow rules with allow or delete them.
Retest with the Google robots.txt tester.
Use proper formatting and syntax.
Pros and Cons of Using Robots.txt
Advantages:
Controls crawler access to reduce server load
Helps manage crawl budget for large sites
Blocks private or duplicate content
Prevents crawling of admin areas or scripts
Disadvantages:
Doesn’t stop indexing of disallowed pages if externally linked
Link equity (PageRank) may be lost if pages are blocked
Rules are not always respected by all bots
Robots.txt and Sitemaps: A Combined Strategy
Linking your XML sitemap inside robots.txt improves discoverability and crawl efficiency.
Example:
Sitemap: https://www.example.com/sitemap.xml
This synergy helps search engines better understand your site’s architecture and index it more effectively.
Role of Robots.txt in Bot Management
While robots.txt helps filter well-behaved bots, malicious bots may ignore it. For stronger bot control, consider using tools like:
Cloudflare Bot Management
DataDome
SpamTitan
These services help block abusive or harmful crawling activities.
Final Thoughts
The robots.txt file is a powerful yet simple tool that helps you guide how search engine bots interact with your website. Used correctly, it can improve crawl efficiency, protect sensitive content, and support your SEO strategy.
However, use caution—misconfiguration can harm your site’s visibility. For best results, combine robots.txt with meta tags and professional SEO guidance to control both crawling and indexing effectively.