Technical SEO

Robots.txt

Shahid Maqbool

By Shahid Maqbool
On Apr 19, 2023

Robots.txt

What is Robots.txt?

Robots.txt is a text file that assists web spiders or crawlers in determining which pages to crawl on a website.

Its main purpose is to prevent search engine crawlers from overloading a website with excessive requests.

However, it is important to note that using robots.txt alone does not prevent a web page from being indexed by Google or other search engines.

To prevent a web page from being indexed by Google, webmasters need to use additional methods such as using the "noindex" meta tag in the HTML header of the web page or password-protecting the page using authentication methods.

These methods directly communicate to search engines not to index the page and keep it out of search engine results.

Robots.txt is part of REP (Robots Exclusion Protocol) - a set of standards responsible for web crawling and indexing.  It is placed in the root directory of a website. 

It's important to note that not all bots will obey the rules specified in the robots.txt file, as it is a voluntary protocol.

However, most major search engines and crawlers will respect the rules set in the file, making it a valuable tool for website owners to control how their content is indexed online.

The basic format of Robots.txt

The basic format of robots.txt is as follows:

User-agent: [user-agent name]

Disallow: [URL not to be crawled]

This is a complete robots.txt file. This file can contain multiple lines of user agents and directives as a distinct set separated by a line break.

For example:

User-agent: *

Allow: /wp-content/uploads/

Disallow: /wp-admin/

User-agent: msnbot

Disallow: /

In the example provided, the first section applies to all user agents (search engine crawlers), as it uses the wildcard symbol "*" to represent all agents.

The directive "Allow" is used to allow the search engine crawler to access files within the "/wp-content/uploads/" directory, while "Disallow" is used to prevent access to the "/wp-admin/" directory.

The second section applies only to the search engine crawler "msnbot" and directs it not to access any pages on the website.

It's important to note that a robots.txt file can contain multiple lines of directives, each separated by a line break.

This allows website owners to set rules for different user agents or to specify different access rules for different parts of the website.

Limitations of Robots.txt

It is important to understand the limitations of robots.txt before creating or editing it.

Not all search engines may support robots.txt rules

While major search engines like Googlebot generally follow the instructions in a robots.txt file, other search engines may not support it.

This means that some bots may still crawl and index your content even if you have specified otherwise in your robots.txt file.

Robots.txt instructions are not enforced

The instructions in a robots.txt file are not enforceable by website owners. It's up to the web crawler to obey them, and while most respectable crawlers follow the rules, others may not.

Therefore, if you want to ensure that certain information is not accessible to web crawlers, it's better to use other methods such as password protection for private files on your server or using a noindex meta tag.

Different crawlers interpret syntax differently

Not all web crawlers interpret robots.txt syntax in the same way. Different crawlers may have different interpretations of the rules, so it's important to understand the proper syntax for addressing different web crawlers to avoid any confusion or misinterpretation.

Disallowed pages can still be indexed if linked from other sites

Even if you disallow a page in your robots.txt file, it may still get indexed by search engines if it is linked from other websites.

This means the URL and other publicly available information, such as anchor text in links, may still appear in search results.

To properly prevent a URL from appearing in search results, you may need to use additional methods such as password protection, the noindex meta tag or response header, or remove the page entirely.

Caution is needed when combining crawling and indexing rules. If you combine multiple crawling and indexing rules in your robots.txt file, some rules may counteract each other, leading to unexpected results. It's important to learn how to properly combine crawling and indexing rules to avoid conflicts and ensure that your intended instructions are followed.

Technical robots.txt syntax

User agents

User agents are browsers, plugins, applications, or software that retrieve and present the information or content to the end users using web technologies.

Each browser has its unique user agent for which robots.txt instructions can be defined.

Hundreds of user agents exist, but the most common are Googlebot, Bingbot, MSNBot, Slurp (Yahoo), DuckDuckbot, and Baiduspider.

You can use the wildcard star (*) to assign directives to all user agents.

User-agent: *

Allow: /

This is how you would do it if you want to block all bots other than Slurp.

User-agent: *

Disallow: /

User-agent: Slurp

Allow: /

You can add directives to multiple user agents in the robots.txt file. The other user agents won't be affected when you add instructions to a new user agent.

Directives

Directives are a set of instructions given to the user agents to follow. Google currently supports the following directives:

Disallow: This directive tells Google or other search engines to avoid accessing particular files, pages, or posts.

For example, if you want Slurp to avoid visiting a website or blog - your robots.txt will look like this.

User-agent: slurp

Disallow: /blog

Allow: This directive allows the search engines to crawl a specific page or post on a website even if it is listed in disallow directory.

For example, if you do not want Slurp to access every post on your website except one, your robots.txt file might look like this:

User-agent: slurp

Disallow: /blog

Allow: /blog/example-post

Sitemap: You can direct the search engines to your XML sitemap using the directive below. 

It typically includes the pages you want search engines to crawl. A sitemap is usually present at the top or bottom of a robots.txt file and looks like this (for a WordPress site):

User-agent: *

Disallow: /wp-admin/

Sitemap: https://www.yourwebsite.com/post-sitemap.xml

Crawl-delay: This directive specifies a delay in search engine crawling. Its purpose is to discourage search engines from overloading the servers and slowing down the websites. It looks like this:

User-agent: *

Crawl-delay: 5

Google does not support this directive, but Bing and Yandex do. However, if you want to do this for Google bots, you may use the Google Search Console.

Noindex: Use this directive to stop the Google bot from indexing a specific page on your website. It typically looks like this:

User-agent: Googlebot

Noindex: /example-blog/

However, Google made it clear that starting Ist, September 2019, this directive would not be officially supported. If you want to exclude a page from appearing in search results, you need to use the meta robots noindex tag.

Google on directives robots.txt

Robots.txt vs. Meta robots tags

Meta robots tags and robots.txt are different in their characteristics but perform the same functions.

Robots.txt is added to the root directory of a website, while meta robots tags are located in the <head> section of a website.

Google suggests using the meta robots tags if you want to deindex a page from search results.

Instead of using a robots.txt file that requires a URL removal request, it is good to use meta robots no index directive for the safe exclusion of a URL from SERPs.

A typical robots meta tag with noindex looks like this and is placed in the head section of a website:

<meta name=”robots” content=”noindex”>

Meta robots tags also make sure to preserve link equity. Some of the most common commands of meta robots tags include:

Indextells search engines to index all content on a website.

Noindex: used not to index a page.

Nofollow: tells crawlers not to follow all links on a page.

Follow: use to follow all the links on a particular page.

Noimageindex: prevents indexing of images on a page.

Noarchive: prevents the cached version of a web page from showing up in SERPs.

Notranslate – hinders the SERPs from translating the page

Nosnippet: tells search engines to avoid creating the snippet of a web page.

Robots.txt Examples

Here are examples of what different directives in robots.txt looks like:

Directive to block all the search engines from crawling your entire website.

User-agent: *

Disallow: /

To allow complete access to all the crawlers on a website.

User-agent: *

Disallow:

To disallow a single bot from accessing a website.

User-agent: msnbot

Disallow: /

To allow a single bot to access a complete website.

User-agent: Googlebot

Disallow:

To block a specific page of a website by a specific bot.

User-agent: Googlebot

Disallow: /page/

To block one subdirectory for a specific bot.

User-agent: Googlebot

Disallow: /folder/

Block all the bots from crawling a particular file

User-agent: *

Disallow: /file.pdf

Prevent all the bots from crawling all the files except one

User-agent: *

Allow: /docs/file.jpeg

Disallow: /docs/

A certain website layout is required to configure all these changes. It is good to carry out all these changes with the help of a professional to avoid any problems.

How does a robots.txt work?

Robots.txt file is similar to other website files but it lacks HTML coding. It can be seen by simply typing the URL of a website and then adding /robots.txt.

https://www.seodebate.com/robots.txt

A subdomain of a website will have its own robots.txt file.

For instance:

blog.yourwebsite.com/robots.txt

community.yourwebsite.com/robots.txt

When a search engine crawler visits a website, it will first look for the robots.txt file. If the file exists, the crawler will read it to determine which pages it should not crawl.

The file includes one or more "Disallow" directives, which instruct the search engine not to crawl specific pages, directories, or entire sections of the website.

By using a robots.txt file, website owners can control which pages and sections of their site are indexed by search engines.

This can help prevent sensitive or duplicate content from being indexed, improve crawl efficiency, and ultimately improve the site's search engine rankings.

Robots.txt is primarily for trustworthy crawlers as it specifies which pages should be crawled and indexed.

Before accessing any other pages on a website, these crawlers first look at this file.

On the other hand, bad bots typically ignore the instructions and follow pages that are prohibited.

How to create a robots.txt file?

To create a robots.txt file, follow these steps:

  1. Open a text editor such as Notepad or TextEdit.

  2. Type "User-agent:" on the first line. This specifies which search engine bots the following directives apply to. The "*" symbol means that the rules apply to all search engine bots.

  3. On the next line, add a "Disallow" directive for each page, directory, or section of your website that you want to block search engine crawlers from accessing. For example, to block all crawlers from accessing the "/admin" directory, add the following line:



    Disallow: /admin/

  4. Repeat step 3 for any additional pages or directories you want to block.

  5. Save the file with the name "robots.txt".

  6. Upload the file to the root directory of your website. This is typically the main directory where your website's index.html or index.php file is located.

Here is an example of a simple robots.txt file that blocks all search engine crawlers from accessing the "/admin" directory:

User-agent: *

Disallow: /admin/

On the other hand, you can simply go to your website control panel and create the robots.txt file here.

Follow these steps:

  • Go to the website control panel and locate the file manager.

  • Click on it and browse in the root directory of your website - which is usually public_html.

  • Click on + file in the top left corner to create a new file.

  • Make sure to name the file robots.txt.

  • Right-click on it and click Edit.

  • Type your robots.txt file with your preferences and save it.

Using Yoast

Creating a robots.txt is simple, especially if you are a WordPress user. Download an SEO plugin called Yoast. Go to Tools and then file editor.

You will find the robots.txt file here. You can add it according to your preferences. A typical file will look like

User-agent: *

Disallow: /wp-admin/

Sitemap: https://yourwebsite/sitemap.xml

Don’t forget to save the changes before leaving.

After creating your robots.txt file, you need to check it. Google provides a free tester in Google Search Console. Go to Google Search Console > Crawl > robots.txt tester. Google Search Console only tests the file you provide - it does not alter the original file on your website.

Where to add the robots.txt file?

This file is always added to the root of your website domain. Suppose your website is www.website.com - the bot will find it at https://www.website.com/robots.txt.

Keep in mind - the robots.txt file is also case-sensitive, so be careful to type it correctly, or it won't function.

How to test Robots.txt?

To test the markup of your robots.txt file, follow these steps:

  • Upload your robots.txt file to your website's root directory.

  • Open a browsing window.

  • Navigate to the location of your robots.txt file using the appropriate URL format, such as https://seodebate.com/robots.txt.

  • If you can see the contents of your robots.txt file displayed in the browser, it means your robots.txt file is publicly accessible and ready for testing.

There are two options provided by Google for testing robots.txt markup:

Robots.txt Tester in Search Console: You can use this tool if your robots.txt file is already accessible on your website.

Simply log in to your Google Search Console account, go to the Robots.txt Tester tool, and enter the URL of your robots.txt file to test its markup.

Google's open-source robots.txt library: If you are a developer, you can check out and build Google's open-source robots.txt library on your computer.

This tool allows you to test robots.txt files locally on your computer before uploading them to your website.

How do I fix the robots.txt errors?

Sometimes, due to robots.txt restrictions, Google cannot crawl certain pages on your website.

These errors are a clear indication that you have blocked the crawlers from crawling your pages.

If you are experiencing this problem, your robots.txt files may look like this:

User-agent: *

Disallow: /

It will stop all search engines from crawling your entire website.

User-agent: *

Disallow: /example-page/

It stops the search engines from crawling a particular page.

In some instances, your page may appear in the search results even if it is blocked with robots.txt - but you will see a message that reads no information is available”.

To fix all these errors, you need to update your robots.txt file. If you want all the search engines to crawl your website, your robots.txt file should be in this format.

User-agent: *

Allow: /

If you only want Googlebot to crawl your website, your robots.txt will be like this:

User-agent: Googlebot

Allow: /

Your robots.txt file will look like this if you want search engines to crawl a specific page:

User-agent: *

Allow: /example-page/

Robots.txt tester in Google Search Console will help you test changes on your website without affecting the live robots.txt file.

After validating the changes - you will implement them in your live robots.txt file.

However, keep a few things in mind to avoid robots.txt errors:

  • Make sure that the file is correctly formatted and that there are no syntax errors. You can use a tool like the W3C Validator to check the syntax of the file.

  • Make sure that the robots.txt file is located in the correct place.

  • Check that the file is accessible to search engine robots. If the file is blocked or inaccessible, this could be causing errors.

  • Ensure that you are using the correct syntax for each directive in the robots.txt file. This will vary depending on the search engine you are targeting.

Pros and cons of Robots.txt

Pros

  • Robots.txt helps manage the crawl budget – the time a bot spends crawling your website pages. Robots.txt instructs a crawler which pages to scan. So by giving instructions to the crawlers - you can save your crawling budget.

  • Robots.txt avoid the crawling and indexing of temporary files. It also prevents the crawling of private pages like background software and other management procedures.

  • You can also protect your sensitive information by using this file. Some common examples of sensitive data that you may want to hide include:

/cart/

/cgi-bin/

/scripts/

/wp-admin/
  • Configuring robots.txt is also crucial for websites with a lot of pages and content. If not done correctly - it may come across a crawler that may put a lot of pressure on a site. It may even affect the normal functioning of a site.

Cons

  • Although, you can use robots.txt files to instruct the crawler which pages to visit - that does not necessarily stop it from indexing and showing the URL in search results. It may show the pages or posts in the search results without knowing what is on the page. Alternatively, you can use a meta robots noindex tag if you want to block a page from showing up in the search results.

  • In case, you have blocked a specific page by using robots.txt, the search engine may not spread the link juice to the other pages.

How to check the robots.txt file of my website?

If you are uncertain whether you have a robots.txt file or not, you can type your website domain name and then add /robots.txt. It will open the robots.txt file of your website if it has any.

www.yourwebsitedomain.com/robots.txt

Is a robots.txt file necessary?

No, the robots.txt file is not necessary for a website. The bot will still crawl and index the pages even if a website doesn't have a robots.txt file.

The main purpose of a robots.txt file is to avoid the load on servers by preventing the crawlers from visiting unnecessary pages. It gives you better control over what is being crawled by the bots.

It helps prevent the crawling of bots over duplicate content and keeps certain sections of your website private.

Is robots.txt safe?

The Robots.txt file itself is not a threat to a website.

Although improving the experience for web crawlers is a good idea, keep in mind that not all web bots are created equal.

Some bots are malicious and search for a way to get unauthorized access. Therefore, it is crucial to avoid posting sensitive information on your website.

How are sitemap protocols linked with robots.txt?

XML sitemaps contain a list of all the pages on your website and provide additional information in the form of metadata.

Robots.txt and XML sitemaps together form a perfect combo and help search engines understand your website in a better way.

In 2006, Google, Yahoo, and Microsoft joined together to promote the standard protocol of a website.

The goal was to assist the webmasters who wanted to provide more accurate information about their websites.

They were required to submit their websites through Google Search Console and webmaster tools of other search engines.

Later, in 2007, they introduced the system of checking the XML sitemaps via robots.txt.

The purpose was to pave the ways for all the bots to discover all the pages on a website.

Sitemaps and robots.txt are often linked together by including a reference to the sitemap in the robots.txt file.

This helps search engines discover the sitemap and understand how the website is organized, which can improve crawl efficiency and ensure that all pages are indexed correctly.

Together, XML sitemaps and a robots.txt file point to a fully operational website that aids search engines in understanding the website content.

By using them, you can enhance the chances of your website's visibility and indexing.

How does robots.txt help in bot management?

Bot management is a process of filtering out the bot's activity on a website. It is crucial to handle the activity of bots to prevent slowing down the websites.

A well-built robots.txt file helps in managing the bot activity. It increases the chances of SEO optimization for a site.

However, you cannot always control the activity of bad bots by using a robots.txt file.

It is good to use a reliable bot management system like Cloudflare Bot Management, DataDome, and SpamTitan to block malicious web crawlers.

The bottom line

Robots.txt allows access to different website crawlers on your website. It is a simple way to tell search engines how to crawl your website or its pages effectively.

Use this file carefully because it will benefit the SEO of your website, or seek expert help while uploading this file to the root directory of your website to prevent errors.

Related Articles

Leave a reply
All Replies (0)