OpenAI introduces GPTBot, a new web crawler for managing site crawling activity.
GPTBot's purpose is to collect data for improving AI models from web pages it crawls.
Users can control GPTBot's access using robots.txt, while Google explores AI-focused alternatives to standard protocols.
OpenAI has revealed details about its latest web crawler named GPTBot. This innovative tool allows website owners to track and manage OpenAI's crawling activity on their sites.
By leveraging the robots.txt protocol, website administrators can now monitor OpenAI's access, limit it if necessary, and gain insights into the extent of the crawling process.
To facilitate this process, OpenAI has provided comprehensive documentation for GPTBot, which can be accessed on its official website.
User-agent token: GPTBot Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
In a manner similar to managing other web crawlers, users have the capability to disallow GPTBot from crawling specific sections or the entirety of their site.
To prevent GPTBot's access to your entire site, you can include the following lines in your site's robots.txt:
User-agent: GPTBot Disallow: /
If you prefer to grant GPTBot access to certain sections while restricting others, you can modify your site's robots.txt in the following manner:
User-agent: GPTBot Allow: /category-1/ Disallow: /category-2/
Currently, GPTBot operates within the IP range 22.214.171.124/28. However, it's important to note that this IP range might undergo changes, necessitating periodic checks for updates in the associated file.
OpenAI has disclosed that GPTBot's primary purpose is to enhance future AI models by aggregating data from web pages it crawls.
The organization also provides guidance on how to block GPTBot's access to a site for those who choose to do so.
Recently, an instance emerged on WebmasterWorld where a webmaster expressed dissatisfaction with GPTBot's activity. The webmaster reported receiving over a thousand hits from the bot, targeting individual pages.
Fortunately, the site's automated system recognized GPTBot's activity and issued a 403 response, denying access due to the bot's absence from the whitelist and its inability to pass the "human" test.
In the past, your ability to restrict access was confined to ChatGPT plugins exclusively.
Presently, there's a noticeable trend where Google and other entities are actively developing an alternative to the conventional robots.txt protocol, specifically designed to cater to the requirements of AI-driven search functionalities.