Technical SEO

GPTBot: OpenAI's New Web Crawler Unveiled

Shahid Maqbool

By Shahid Maqbool
On Aug 8, 2023

GPTBot: OpenAI's New Web Crawler Unveiled

Key Takeaways

  • OpenAI introduces a new web crawler named GPTBot for managing site crawling.

  • Its purpose is to collect data to improve AI models.

  • Users can control GPTBot's access using robots.txt.

OpenAI has revealed details about its web crawler named GPTBot. GPTBot goes from website to website and collects data from the pages it visits.

The purpose of GPTBot is to gather information from across the internet. OpenAI will use this data to improve their artificial intelligence models.

AI models need lots of data to learn from when they are created and updated.

Website owners can control whether GPTBot is allowed to visit and collect data from their site. They do this by editing "robots.txt".

OpenAI has provided comprehensive documentation for GPTBot, which can be accessed on its official website.

User-agent token: GPTBot

Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

If a website owner wants to block GPTBot from their entire site completely, they can add some code to the robots.txt file telling GPTBot is not allowed anywhere on that site.

User-agent: GPTBot

Disallow: /

If you only want GPTBot to avoid certain sections or pages, you can specify those sections in the file as being disallowed for GPTBot.

User-agent: GPTBot

Allow: /category-1/

Disallow: /category-2/

Right now, GPTBot operates and visits websites within a certain range of Internet Protocol (IP) addresses - 40.83.2.64/28.

The IP range GPTBot uses could change over time, so website owners need to keep checking for updates on the new IPs GPTBot uses.

GPTBot's IP range for operation

OpenAI has disclosed that GPTBot's primary purpose is to enhance future AI models by aggregating data from pages it crawls.

OpenAI statement regarding GPTBot's for enhanced future AI models

OpenAI has published detailed instructions and documentation explaining exactly how GPTBot works and how website owners can manage and control GPTBot's access if they want to limit its crawling.

However, some websites have already had issues with GPTBot excessively crawling and visiting too many of their individual pages. These sites had to take steps to block GPTBot's access because it was doing too much crawling.

webmaster expressed dissatisfaction with GPTBot's activity

Google and other major companies are now working on creating new protocols and technology standards as alternatives to robots.txt.

These new methods would be designed specifically for controlling access to AI-powered web crawlers and search engines. The current robots.txt protocol may not work as well for the latest AI systems.

Related Articles

Leave a reply
All Replies (0)