What are Crawl Directives?
Crawl directives are instructions you can give to search engines to control how they interact with your website.
These directives allow you to do several things:
Tell a search engine not to crawl a specific page on your website
Instruct a search engine not to use a page in its index after it has been crawled
Control whether search engines should follow or no follow links on a particular page
Set many other "minor" directives such as specifying how often a page should be crawled, whether search engines should ignore certain types of content, or how long search engines should wait before crawling a page again
Types of crawl directives
Crawl directives can be categorized into two types:
Robots meta directives
Websites have special "meta tags" that give instructions to search engine crawlers. These tags are called "robots meta directives."
They tell the search bots how to handle and interact with the website's pages. The directives provide guidance on crawling and indexing pages.
Unlike the robots.txt file directives which just suggest things, these meta directives give more definite commands. So they have more power to control how search bots access website content.
There are two types of robots meta directives:
Meta robots tags
The meta robots tag also referred to as "meta robots" or "robots tag" is an HTML code websites use. It goes in the head section of a web page.
This tag gives instructions to search engine crawlers and tells them how to interact with the page.
It is commonly used to control the indexing and crawling behaviour of search engine bots, which can affect how the web page appears in SERPs.
Let's say you have a web page that contains sensitive information. It can be a login page or a page with confidential data that you don't want to be indexed.
You can use the meta robots tag to prevent search engine bots from indexing that page.
Here's an example of how the meta robots tag might be used in the HTML code of such a page:
<meta name="robots" content="noindex, nofollow">
<!-- other head elements go here -->
<!-- login page content goes here -->
This tells search engine bots not to index the page and not to follow any links on the page.
The X-Robots-Tag is part of the HTTP header response that is sent when a URL is requested, and it can be used to control indexing for an entire page or specific elements on that page.
Compared to using meta robots tags, which are relatively simple, the X-Robots-Tag is more complex.
However, it offers more flexibility and functionality.
There are certain situations where using the X-Robots-Tag is recommended. The two most common scenarios are:
when you want to control how non-HTML files (docs, pdfs, videos) are crawled and indexed
when you want to apply directives site-wide instead of on a page level
For example, if you want to block a specific image or video from being crawled, you can easily do so using the X-Robots-Tag header in the HTTP response.
Another advantage of using the X-Robots-Tag is that it allows you to combine multiple tags within an HTTP response or use a comma-separated list of directives to specify instructions to search engine bots.
For instance, if you don't want a certain page to be cached and want it to be unavailable after a certain date, you can use a combination of "noarchive" and "unavailable_after" tags in the X-Robots-Tag to convey these instructions to search engine bots.
Both types of directives can use the same parameters, such as "noindex" and "nofollow" to instruct crawlers.
The difference is in how these parameters are communicated to the crawlers, with meta robots directives being embedded within the HTML code, and x-robots-tag being sent as HTTP headers by the web server.
Robots.txt directives are used to guide search engine robots or crawlers when they navigate a website and lead them to the correct pages for crawling.
Robots.txt directives are used to control which parts of a website search engine crawlers are allowed or disallowed to access and crawl.
These directives apply to entire sections or directories of a website, controlling crawler access at a broader level.
Unlike meta robots tags, robots.txt file - containing directives for crawlers - is placed in the root directory of a website.
The robots.txt file can block search bots from crawling certain pages. But it doesn't stop pages from being indexed if other sites link to them.
Also, not all search engine crawlers follow the robots.txt rules completely. Some may still index blocked pages.
To put it simply, the robots.txt file provides guidance to search engine bots on how to crawl your website.
It enables you to specify which sections of your site are open for crawling and which ones are not.
Note: If a web page is blocked from crawling through the robots.txt file, search engines won't be able to discover any indexing or serving rules specified in robots meta tags or X-Robots-Tag HTTP headers. So, if you want search engines to follow indexing or serving rules, you cannot block the URLs containing those rules from crawling in the robots.txt file.
An overview of the difference between different crawl directives
Meta Robots Tag
To provide directives for web crawlers on how to interact with a website or specific pages for crawling
To control the indexing and following of links on a specific webpage
To control the indexing and following of links on a specific webpage or non-HTML files (like PDFs)
In the website's root directory as a separate text file
In the HTML head section of each individual webpage
In the HTTP header of each individual webpage or file
<meta name="robots" content="directive1,directive2">
X-Robots-Tag: directive1, directive2
Website or directory level
Page level, and non-HTML files
Blocking web crawlers
Yes, by specifying user agents and disallowed paths
Yes, by using "noindex" and "nofollow" directives
Yes, by using "noindex" and "nofollow" directives
Yes, for some web crawlers
Supported by all major search engines
Influence on non-HTML files
Crawl directives are important for website optimization. They control how search engines access and scan sites.
Proper use of crawl directives ensures search engines index sites correctly. Owners can target important pages and block anything sensitive.
This optimization helps improve search visibility and rankings.