Technical SEO

Crawl Directives

Shahid Maqbool

By Shahid Maqbool
On Apr 27, 2023

Crawl Directives

What are Crawl Directives?

Crawl directives are instructions you can give to search engines to control how they interact with your website.

These directives allow you to do several things:

  • Tell a search engine not to crawl a specific page on your website

  • Instruct a search engine not to use a page in its index after it has been crawled

  • Control whether search engines should follow or no follow links on a particular page

  • Set many other "minor" directives such as specifying how often a page should be crawled, whether search engines should ignore certain types of content, or how long search engines should wait before crawling a page again

Types of crawl directives

Crawl directives can be categorized into two types:

Robots meta directives

Websites have special "meta tags" that give instructions to search engine crawlers. These tags are called "robots meta directives."

They tell the search bots how to handle and interact with the website's pages. The directives provide guidance on crawling and indexing pages.

Unlike the robots.txt file directives which just suggest things, these meta directives give more definite commands. So they have more power to control how search bots access website content.

There are two types of robots meta directives:

Meta robots tags

The meta robots tag also referred to as "meta robots" or "robots tag" is an HTML code websites use. It goes in the head section of a web page.

This tag gives instructions to search engine crawlers and tells them how to interact with the page.

It is commonly used to control the indexing and crawling behaviour of search engine bots, which can affect how the web page appears in SERPs.

Example:

Let's say you have a web page that contains sensitive information. It can be a login page or a page with confidential data that you don't want to be indexed.

You can use the meta robots tag to prevent search engine bots from indexing that page.

Here's an example of how the meta robots tag might be used in the HTML code of such a page:

<!DOCTYPE html>

<html>

<head>

    <meta name="robots" content="noindex, nofollow">

    <title>Login Page</title>

    <!-- other head elements go here -->

</head>

<body>

    <!-- login page content goes here -->

</body>

</html>

In this example, the meta robots tag is included in the <head> section of the web page with the content attribute set to "noindex, nofollow".

This tells search engine bots not to index the page and not to follow any links on the page.

X-robots tags

The X-Robots-Tag is part of the HTTP header response that is sent when a URL is requested, and it can be used to control indexing for an entire page or specific elements on that page.

Compared to using meta robots tags, which are relatively simple, the X-Robots-Tag is more complex.

However, it offers more flexibility and functionality.

There are certain situations where using the X-Robots-Tag is recommended. The two most common scenarios are:

  • when you want to control how non-HTML files (docs, pdfs, videos) are crawled and indexed

  • when you want to apply directives site-wide instead of on a page level

For example, if you want to block a specific image or video from being crawled, you can easily do so using the X-Robots-Tag header in the HTTP response.

Another advantage of using the X-Robots-Tag is that it allows you to combine multiple tags within an HTTP response or use a comma-separated list of directives to specify instructions to search engine bots.

For instance, if you don't want a certain page to be cached and want it to be unavailable after a certain date, you can use a combination of "noarchive" and "unavailable_after" tags in the X-Robots-Tag to convey these instructions to search engine bots.

Both types of directives can use the same parameters, such as "noindex" and "nofollow" to instruct crawlers.

The difference is in how these parameters are communicated to the crawlers, with meta robots directives being embedded within the HTML code, and x-robots-tag being sent as HTTP headers by the web server.

Robots.txt directives

Robots.txt directives are used to guide search engine robots or crawlers when they navigate a website and lead them to the correct pages for crawling.

Robots.txt directives are used to control which parts of a website search engine crawlers are allowed or disallowed to access and crawl.

These directives apply to entire sections or directories of a website, controlling crawler access at a broader level.

Unlike meta robots tags, robots.txt file - containing directives for crawlers - is placed in the root directory of a website.

The robots.txt file can block search bots from crawling certain pages. But it doesn't stop pages from being indexed if other sites link to them.

Also, not all search engine crawlers follow the robots.txt rules completely. Some may still index blocked pages.

To put it simply, the robots.txt file provides guidance to search engine bots on how to crawl your website.

It enables you to specify which sections of your site are open for crawling and which ones are not.

Note: If a web page is blocked from crawling through the robots.txt file, search engines won't be able to discover any indexing or serving rules specified in robots meta tags or X-Robots-Tag HTTP headers. So, if you want search engines to follow indexing or serving rules, you cannot block the URLs containing those rules from crawling in the robots.txt file.

An overview of the difference between different crawl directives

Feature

robots.txt

Meta Robots Tag

X-Robots-Tag

Purpose

To provide directives for web crawlers on how to interact with a website or specific pages for crawling

To control the indexing and following of links on a specific webpage

To control the indexing and following of links on a specific webpage or non-HTML files (like PDFs)

Location

In the website's root directory as a separate text file

In the HTML head section of each individual webpage

In the HTTP header of each individual webpage or file

Syntax

User-agent:

Disallow:

Allow:

<meta name="robots" content="directive1,directive2">

X-Robots-Tag: directive1, directive2

Control granularity

Website or directory level

Page level

Page level, and non-HTML files

Blocking web crawlers

Yes, by specifying user agents and disallowed paths

Yes, by using "noindex" and "nofollow" directives

Yes, by using "noindex" and "nofollow" directives

Crawl-delay directive

Yes, for some web crawlers

No

No

Supported by all major search engines

Yes

Yes

Yes

Influence on non-HTML files

No

No

Yes

Takeaway

Crawl directives are important for website optimization. They control how search engines access and scan sites.

Proper use of crawl directives ensures search engines index sites correctly. Owners can target important pages and block anything sensitive.

This optimization helps improve search visibility and rankings.

Related Articles

Leave a reply
All Replies (0)