What are Crawl Directives?
Crawl directives are instructions you can give to search engines to control how they interact with your website.
These directives allow you to do several things:
Tell a search engine not to crawl a specific page on your website
Instruct a search engine not to use a page in its index after it has been crawled
Control whether search engines should follow or no follow links on a particular page
Set many other "minor" directives such as specifying how often a page should be crawled, whether search engines should ignore certain types of content, or how long search engines should wait before crawling a page again
Types of crawl directives
Crawl directives can be categorized into two types:
Robots meta directives (also known as meta tags)
Robots.txt file directives
Robots meta directives
Robots meta directives, also known as "meta tags," are snippets of code that tell web crawlers how to handle a web page content.
These directives are used to provide guidance to search engine bots on how to interact with a website's pages.
Unlike robots.txt file directives, which are merely suggestions, robots meta directives provide more definitive instructions on how a page should be crawled and indexed.
There are two types of robots meta directives:
Meta robots tags
The meta robots tag, also referred to as "meta robots" or "robots tag," is an HTML code element that is typically placed in the <head> section of a web page.
It provides instructions to web robots or search engine crawlers on how to interact with the web page.
It is commonly used to control the indexing and crawling behaviour of search engine bots, which can affect how the web page appears in search engine results.
Example:
Let's say you have a web page that contains sensitive information, such as a login page or a page with confidential data that you don't want to be indexed by search engines.
You can use the meta robots tag to prevent search engine bots from indexing that page.
Here's an example of how the meta robots tag might be used in the HTML code of such a page:
<!DOCTYPE html>
<html>
<head>
<meta name="robots" content="noindex, nofollow">
<title>Login Page</title>
<!-- other head elements go here -->
</head>
<body>
<!-- login page content goes here -->
</body>
</html>
In this example, the meta robots tag is included in the <head> section of the web page with the content attribute set to "noindex, nofollow".
This tells search engine bots not to index the page and not to follow any links on the page. As a result, this page will not appear in search engine results and will not be crawled by search engine bots.
X-robots tags
The X-Robots-Tag is part of the HTTP header response that is sent when a URL is requested, and it can be used to control indexing for an entire page or specific elements on that page.
Compared to using meta robots tags, which are relatively simple, the X-Robots-Tag is more complex.
However, it offers more flexibility and functionality.
There are certain situations where using the X-Robots-Tag is recommended. The two most common scenarios are:
when you want to control how non-HTML files (docs, pdfs, videos) are crawled and indexed
when you want to apply directives site-wide instead of on a page level
For example, if you want to block a specific image or video from being crawled, you can easily do so using the X-Robots-Tag header in the HTTP response.
Another advantage of using the X-Robots-Tag is that it allows you to combine multiple tags within an HTTP response or use a comma-separated list of directives to specify instructions to search engine bots.
For instance, if you don't want a certain page to be cached and want it to be unavailable after a certain date, you can use a combination of "noarchive" and "unavailable_after" tags in the X-Robots-Tag to convey these instructions to search engine bots.
Both types of directives can use the same parameters, such as "noindex" and "nofollow" to instruct crawlers.
The difference is in how these parameters are communicated to the crawlers, with meta robots directives being embedded within the HTML code, and x-robots-tag being sent as HTTP headers by the web server.
Robots.txt directives
Robots.txt directives are used to guide search engine robots or crawlers when they navigate a website and lead them to the correct pages for crawling.
Robots.txt directives are used to control which parts of a website search engine crawlers are allowed or disallowed to access and crawl.
These directives apply to entire sections or directories of a website, controlling crawler access at a broader level.
Unlike meta robots tags, robots.txt file - containing directives for crawlers - is placed in the root directory of a website.
While robots.txt directives can block crawling, they don't prevent a page from being indexed if it is linked from other indexed pages. Also, not all crawlers obey the rules set in the robots.txt file.
To put it simply, the robots.txt file provides guidance to search engine bots on how to crawl your website.
It enables you to specify which sections of your site are open for crawling and which ones are not.
Note: If a web page is blocked from crawling through the robots.txt file, search engines won't be able to discover any indexing or serving rules specified in robots meta tags or X-Robots-Tag HTTP headers. So, if you want search engines to follow indexing or serving rules, you cannot block the URLs containing those rules from crawling in the robots.txt file.
An overview of the difference between different crawl directives
Feature | robots.txt | Meta Robots Tag | X-Robots-Tag |
---|---|---|---|
Purpose | To provide directives for web crawlers on how to interact with a website or specific pages for crawling | To control the indexing and following of links on a specific webpage | To control the indexing and following of links on a specific webpage or non-HTML files (like PDFs) |
Location | In the website's root directory as a separate text file | In the HTML head section of each individual webpage | In the HTTP header of each individual webpage or file |
Syntax | User-agent: Disallow: Allow: | <meta name="robots" content="directive1,directive2"> | X-Robots-Tag: directive1, directive2 |
Control granularity | Website or directory level | Page level | Page level, and non-HTML files |
Blocking web crawlers | Yes, by specifying user agents and disallowed paths | Yes, by using "noindex" and "nofollow" directives | Yes, by using "noindex" and "nofollow" directives |
Crawl-delay directive | Yes, for some web crawlers | No | No |
Supported by all major search engines | Yes | Yes | Yes |
Influence on non-HTML files | No | No | Yes |
Takeaway
Crawl directives are a crucial aspect of website optimization. By using robots meta directives and robots.txt file directives, website owners can control how search engines interact with their sites.
It is essential to use crawl directives effectively to ensure that search engines crawl and index your website correctly.