Technical SEO

Crawl Budget

Shahid Maqbool

By Shahid Maqbool
On Apr 12, 2023

Crawl Budget

What is Crawl Budget?

The crawl budget is the limit Google sets on how many pages of a website Googlebot can access and index within a certain period.

The actual number of pages Googlebot crawls on any given day changes.

It depends on things like how big the website is, how complex it is, how important Google thinks it is, and the availability of server resources.

Why is the crawl budget important?

The crawl budget matters for SEO because it controls how easily search engine crawlers can access and index the pages of your website.

It is key because it decides how well search engines can crawl and add your website's pages to their databases.

This then impacts how visible your site is in search results, and how much traffic and money it can potentially make.

If a website has a large crawl budget, Googlebot can rapidly crawl and index more of its pages. So new content is more likely to be found and included, improving visibility in search.

But if the crawl budget is small, bots may have trouble crawling all pages. Some pages might then not get indexed or updated. This can reduce rankings and visibility.

In short, the crawl budget controls how easily search engines can discover and add your site's pages.

A large budget helps new content get indexed fast. A small one means missed chances for traffic and visibility.

How does the crawl budget work?

Google decides a website's crawl budget using two key things: the crawl rate limit and crawl demand.

The crawl rate limit is the maximum number of requests Googlebot can make per second without overloading a site.

The crawl demand is how often and urgently Googlebot needs to crawl that site's pages.

Crawl demand

Crawl demand is a key factor in determining how frequently and urgently Googlebot crawls a website's pages.

It can have a significant impact on a site's crawl budget, which is the number of URLs that Googlebot can and wants to crawl on a site.

Here are some important considerations related to the crawl demand:

Popularity

When it comes to the popularity of URLs on the internet, it's important to consider the level of traffic and social media engagement they receive.

The more popular a website is, the more frequently it will need to be crawled to ensure it stays fresh in Google's index.

This is where crawl demand comes into play, as websites with high traffic and engagement are likely to have a higher crawl demand than their less popular counterparts.

Staleness

The freshness of URLs is also a key factor in Google's indexing process.

Google's systems work hard to prevent URLs from becoming stale in the index, which means that pages containing time-sensitive information or those that are updated frequently may have higher crawl demand to keep the index up-to-date.

Site-wide events

Big changes to a website can make search engines crawl it more.

For example, if a website gets a new design or lots of new pages, Google will visit it more to see the changes. This means the website needs more crawl demand to handle the extra visits from Google.

Website owners should remember this when working on SEO. They want their site content to show up properly in search results.

So it's good to know that updates and redesigns might temporarily increase how much Google crawls the site.

Crawl rate limit

The crawl rate limit controls how fast Googlebot visits a website. It sets limits on:

  • How many connections Googlebot can use to crawl the site at the same time

  • How long Googlebot has to wait between visits

The crawl rate limit is an important setting because it directly impacts how quickly your site gets indexed and updated in Google's search results.

It's worth noting that there are a few factors that can impact the crawl rate limit. 

Crawl Health

If a website loads fast and works well for a long time, Google may raise the crawl rate limit. This allows Googlebot to visit and scan more pages on that site.

But if a website starts running slower or giving server errors, Google may lower the crawl rate limit. This causes Googlebot to visit the struggling site less often.

The limit set in the Search Console

Another factor that can impact the crawl rate limit is the limit set in the Search Console by website owners.

They can set limits on Googlebot's crawling of their site through the Search Console.

However, it's essential to note that setting higher limits doesn't always lead to increased crawling.

History of crawl budget

The idea of a crawl budget started in 2009 when Google said it had limits on its ability to crawl the internet.

Google explained that its Googlebot could only crawl and index a small part of all the content available online.

Google encouraged website owners to optimize their sites to work within Googlebot's crawl budget limitations. This meant making sure the most important pages were easiest for Google to find and crawl since it couldn't crawl every page on the internet.

“The Internet is a bigplace; new content is being created all the time. Google has a finite number of resources, so when faced with the nearly-infinite quantity of content that's available online, Googlebot is only able to find and crawl a percentage of that content. Then, of the content we've crawled, we're only able to index a portion.

Google

Over time, SEOs and webmasters began to pay more attention to the crawl budget, recognizing its importance for ensuring that their websites were properly indexed by search engines. 

In response to this growing interest, Google published a post in 2017 titled "What crawl budget means for Googlebot," which clarified how Google calculates crawl budget and how it thinks about the concept.

“First, we'd like to emphasize that crawl budget, as described below, is not something most publishers have to worry about. If new pages tend to be crawled the same day they're published, crawl budget is not something webmasters need to focus on. Likewise, if a site has fewer than a few thousand URLs, most of the time it will be crawled efficiently.

Google

What factors does Google consider for crawl budget allocation?

Google considers several factors to determine the crawl budget allocated to a website. 

Some of the main factors include:

  1. Site size: Bigger sites require more crawl budget.

  2. Server setup: A site's performance and load times may affect the crawl budget.

  3. Update frequency: Google prioritizes content that gets updated regularly.

  4. Links: Internal linking structure and dead links.

  5. Faceted navigation and infinite filtering combinations: Faceted navigation can generate new URLs based on parameters selected, wasting the crawl budget.

  6. Session identifiers and tracking IDs: Parameters used for analytics or user preferences through the URL may create duplicate pages.

  7. On-site duplicate content, soft error pages, hacked pages, infinite spaces and proxies, low-quality, and spam content: Having many low-value-add URLs can negatively affect crawling and indexing.

What crawl budget encompasses other than pages?

Although we often talk about crawl budget in terms of pages, it's important to note that it actually encompasses any document that search engines crawl.

This includes not only HTML pages, but also JavaScript and CSS files, mobile page variants, hreflang variants, and even PDF files.

All of these documents can affect the overall crawl budget of a website, so it's important to consider them when optimizing for the crawl budget.

Who should care about the crawl budget?

As quoted earlier from Google:

“...if a site has fewer than a few thousand URLs, most of the time it will be crawled efficiently.

Websites with fewer pages are typically easier for Googlebot to crawl and index completely. Google is less likely to have issues crawling smaller sites within its resource limits.

But crawl budget is more of a concern for large websites that have many pages or complex architectures.

The owners of big websites with a lot of content may need to optimize how efficiently Google can crawl and index their pages. This is because Googlebot has a harder time getting to all the content given the expansive size and complexity.

So large sites have to actively manage crawl budget to help Googlebot prioritize indexing the most important pages.

For large site owners, Google elaborates on managing the crawl budget. This comprehensive guide can help websites to manage their crawl budget more efficiently.

How to check the crawl activity?

Website owners can check crawl activity for their websites.  

Use Google Search Console

You can use your Google Search Console account to check the stats of your daily crawl budget.

  • Log in to your Google Search Console account and select the website you want to check.

  • Click on the "Crawl" tab located on the left side of the screen.

  • Select "Crawl Stats" under the "Crawl" tab.

  • In this section, you can view the number of pages Google crawls daily, as well as other useful metrics like page download time and response codes.

Here, you can view the average number of pages that Google crawls on your website per day.

For example, let's say the average crawl budget for your website is 100 pages per day.

In theory, Google would crawl 3,000 pages on your website per month (100 pages x 30 days). 

However, if your website has a large number of low-quality or duplicate pages, Google may prioritize crawling higher-quality pages, leading to a lower actual crawl rate.

On the other hand, if your website has grown significantly over time and now offers a wealth of high-quality content, Google may increase your crawl budget to keep up with the demand. 

Check your server logs

Checking your website's server logs can also provide insight into how much Google is crawling your site. Server logs record every request to your site, including from Googlebot.

By accessing these log files from your web host or server, you can see:

  • Which of your pages Googlebot crawls most often

  • Which pages are crawled less frequently

  • Pages Googlebot doesn't crawl at all

Analyzing these server logs allows you to better understand how Googlebot interacts with your site. You can then optimize your crawl budget accordingly.

For example, you may want to add caching for heavily crawled pages to reduce server load. Or, improve internal links to pages that aren't being crawled enough to make them more visible.

What badly affects the crawl budget?

Several factors can negatively affect the crawl budget including:

Duplicate content

Duplicate content refers to pages that are highly similar or identical and can negatively impact your crawl budget as search engines may not see these pages as valuable and may choose not to crawl them thereby affecting your crawl budget.

Some common causes of duplicate pages can be: 

Faceted Search and Session IDs

Faceted search and session IDs can create duplicate content issues that waste the crawl budget.

Faceted search allows users to filter e-commerce results by attributes like price and colour. This generates multiple URLs displaying the same content but with different filters.

Session IDs uniquely track users as they browse a site. But they can also produce multiple URLs for the same page.

If faceted search and session IDs create too many duplicate URLs, it wastes the crawl budget on duplicate pages.

Google looks at how efficiently sites use their budget when deciding on more allocation. This includes metrics like:

  • Number of unique URLs

  • Crawl rate

  • Server response times

So to optimize the crawl budget, site owners should ensure faceted search and session IDs don't produce excessive duplicate content.

Multiple versions of the same page

If a website has multiple versions of the same page with different URLs, search engines may view these as separate pages, even though the content is the same. 

This can lead to search engines allocating their crawl budget to crawl these duplicate pages, instead of focusing on other important pages on the website.

Here are some more examples of multiple versions of the same page with different URLs:

  • Different subdomains: For example, if you have the same content on www.example.com/page and example.com/page, search engines may see these as different pages and may allocate resources to crawl both URLs.

  • URL parameters: If you have the same content on http://example.com/page and http://example.com/page?utm_source=google&utm_medium=cpc&utm_campaign=summer_sale, search engines may view these as different pages and crawl both URLs, even though they contain the same content.

  • Case sensitivity: If your server is case sensitive, search engines may view URLs like http://example.com/page and http://example.com/Page as different pages and crawl both URLs.

  • Trailing slash: If you have the same content on http://example.com/page and http://example.com/page/, search engines may view these as different pages and crawl both URLs.

Never-ending URLs

Some websites have "never-ending" URLs that keep getting longer as users scroll down a page. More content loads dynamically without refreshing. Social media, news, and e-commerce sites often use these for long pages.

The problem is that Google may see slightly different URL versions as unique pages, even though it's the same page with more content loaded. This creates duplicate content crawl issues.

Google continuously crawls and indexes essentially the same page over and over, wasting the crawl budget. It also bloats Google's index with duplicate versions of pages.

This duplicate content and endless indexing can hurt a site's overall search performance.

It's better to use pagination or other techniques. This avoids infinitely expanding URLs for the same page content.

High numbers of non-indexable pages

If a significant portion of your pages is non-indexable, it can lead to a waste of resources and time for the crawlers.

Here are some ways in which different types of non-indexable pages can negatively impact your website's crawl budget:

Redirects (3xx)

When search engine crawlers encounter redirects, they will follow the redirect and crawl the new URL. 

However, if there are too many redirects, it can lead to a waste of resources as crawlers may not be able to crawl all the redirected pages in a single visit.

This can also impact the time it takes for search engines to discover new content on your site.

Pages that can't be found (4xx)

If your website contains a high number of pages that return a 4xx status code (e.g. 404 Page Not Found), it can lead to a waste of resources as crawlers may keep trying to crawl these pages repeatedly, thinking that they are temporary errors. 

Pages with server errors (5xx)

If your website contains a high number of pages that return a 5xx status code (e.g. 500 Internal Server Error), it can signal to search engines that your website is experiencing technical issues. 

This can negatively impact your website's crawl budget as search engines may reduce their crawl rate or stop crawling your site altogether until the issues are resolved.

Pages containing noindex directive

Pages that contain the robots noindex directive may not be crawled or indexed by search engines. 

While these pages may not necessarily waste crawler resources, they can still negatively impact your website's crawl budget if there are a large number of them.

This is because search engines may spend time crawling and re-crawling these pages instead of discovering and indexing new content on your site.

Hacking

Hacking can have a significant impact on the crawl budget by creating a large number of non-indexable pages, redirects, and malicious content on the website.

This can result in a decrease in crawl frequency and a decrease in search engine rankings.

Some ways hacking can harm crawl budget include:

Injecting code or content

Hackers may inject malicious code or content into a website to manipulate search engine rankings or to install malware on users' machines.

This can lead to search engines crawling irrelevant pages, reducing the crawl budget, and negatively affecting search rankings.

Creating new pages

Hackers can create new pages on a website with spammy or malicious content, which can harm the site's visitors and its performance in search results.

Search engines may crawl these pages and reduce the crawl budget of the website.

Manipulating existing pages

Hackers can secretly change parts of current website pages. They add hidden spam links or text using CSS or HTML code.

Users may not notice these additions. But search engines see everything in the page code. If Google finds these subtle spam hacks, they may lower the number of pages they crawl on that site.

Redirects

Hackers may inject code that redirects users to harmful or spammy pages, which can harm the site's visitors and lead to a decrease in the crawl budget.

Low-quality content

Low-quality content refers to pages that have very little content or provide little value to users. 

This type of content can negatively impact a website's crawl budget because search engines may view these pages as unimportant or not worth crawling.

Low-quality content can include thin or duplicate content, keyword-stuffed content, and content that does not meet the user's intent.

One example of low-quality content is a FAQ section where each question and answer is served over a separate URL.

This type of content not only creates a large number of low-quality pages but also generates multiple URLs for essentially the same content. 

This can lead to duplicate content issues and confusion for search engines as they try to determine which URL to crawl and index.

Pages with high load time

Page load time is how long it takes for all parts of a page to show up when someone visits a website. This includes loading all the HTML, CSS, JavaScript, images and other things that make the page work.

If pages take very long to fully load or don't load at all, search engines see that as a sign of bad website performance.

So when Google finds lots of slow or broken pages that won't display properly, they may decide to crawl fewer pages on that site. This reduces the website's crawl budget.

Bad internal link structure

Good website structure includes organizing pages in a sensible way and linking related content.

How pages link to each other impacts the crawl budget. If the internal links are chaotic or incomplete, some pages get less attention from Google.

Pages with few internal links coming to them may be seen as less valuable by Google. So they often crawl them less, reducing their page-level crawl budget.

Pages deep in complex site structures also end up with fewer total links. Google is less likely to find and frequently crawl pages that take a lot of clicks to reach.

On the flip side, pages that many other pages link to will be discovered and crawled more by Google.

Incorrect URLs in XML sitemap

An XML sitemap is a file that lists all the pages on a website that a webmaster wants search engines to index.

It helps search engines crawl a website more efficiently by providing a roadmap of all the pages. 

Incorrect URLs in XML sitemaps can significantly impact crawl budget optimization.

If a website's XML sitemap contains URLs that no longer exist or have been redirected, search engines may spend their crawl budget trying to access these pages.

On the other hand, if a website's XML sitemap excludes important pages, search engines may not crawl them.

This can also affect the website's crawl budget negatively as search engines may not crawl all the pages that they could have crawled if the XML sitemap was correctly structured.

Regularly checking the XML sitemap for errors and making sure it only contains URLs of indexable pages is crucial for efficient crawl budget management.

Crawl budget optimization

Crawl optimization can go side by side with search engine optimization. More precisely, appropriate SEO can automatically optimize the crawl budget.

However, you can make it even better if you keep certain technicalities in mind that you may overlook otherwise.

Build an organized and updated XML sitemap

An XML sitemap is particularly useful for large or new websites with many pages. To access the XML sitemap, you can simply add "/sitemap.xml" after the main URL of the website.

It's important to note that the XML sitemap is not intended for users, as it is written in a specific format that search engine bots can understand, rather than HTML, which is more user-friendly.

Therefore, it's important to include only indexable, updated, and essential pages in the sitemap, and to update it regularly. 

After creating the sitemap, it should be uploaded to the Root Directory and the sitemap should be submitted to Google using Google Search Console.

Fix broken links

Broken links are hyperlinks on a web page that no longer work, typically because the destination page has been deleted or moved or its URL has changed. 

When search engines encounter broken links or long chains of redirects, they are unable to reach the intended page and essentially hit a "dead end." 

This can waste the crawl budget as search engines spend time attempting to access pages that are no longer available or are redirected multiple times, reducing the number of pages they can crawl on your site. 

By fixing broken links and reducing the number of redirects, you can quickly recover the wasted crawl budget and improve the user experience for visitors to your site. 

To fix broken links and optimize your crawl budget, you can follow these steps:

Identify broken links

Use a broken link checker tool or crawl your website regularly to identify any broken links.

Prioritize fixing high-priority broken links

Fix links that are important for user experience or lead to important pages on your website first.

Update external links

After fixing internal broken links, update any external links pointing to the broken link.

Redirect broken external links

If a broken link is pointing to an external website, consider redirecting it to a working page on a similar topic or replace it completely with a new URL.

Use 301 redirects

For broken links that cannot be fixed, use a 301 redirect to point the link to a working page. This ensures that any traffic or search engine value from the broken link is redirected to a working page on your website.

Monitor and recheck regularly

Continuously monitor your website for broken links and recheck previously fixed links to ensure they remain working.

Carefully build an internal link structure

Having a solid internal link structure is crucial for optimizing the crawl budget and improving the user experience on your website. 

One important aspect of internal linking is ensuring that your most important pages have plenty of links pointing to them.

This helps search engines understand the hierarchy and importance of your content.

When it comes to older content that still drives traffic, it's important to make sure that it's still easily accessible to both users and crawlers. 

One way to do this is by linking to the older content from newer articles, blog posts, or other relevant pages on your website. 

This helps to keep the older content at the forefront of your internal link structure and ensures that it remains visible and accessible to both users and crawlers.

Fast load time

One of the ways to optimize the crawl budget is by ensuring that your website pages have fast load times.

Pages that load in more than 5 seconds are problematic and may negatively impact the user experience. 

Ideally, your pages should load in under 5 seconds to enhance user satisfaction.

To monitor your website's page load times, you have to consider the following points:

Improving page load times by optimizing your JavaScript

One of the most important factors that affect user experience and search engine rankings is page speed.

JavaScript can often be a source of slow page load times, so optimizing it can have a significant impact.

This can involve reducing the size of JavaScript files, minimizing the number of requests made by JavaScript by combining them and deferring JavaScript loading.

Optimize images

Images can significantly impact the loading time of a webpage. You can optimize images by compressing them, reducing their file size, and choosing the right file format.

Minimize HTTP requests

Each request made to the server takes time to complete, so minimizing the number of HTTP requests can improve load time.

You can do this by combining multiple files into one or using CSS sprites.

Reduce server response time

How quickly the server responds to visitors can affect how fast pages load.

To speed up the server response time, you can use a content delivery network.

Also, reduce the number of plugins and scripts running on the server and use caching.

Minify code

Minifying code means removing extra spaces, line breaks and other unneeded characters from HTML, CSS and JavaScript files. This makes the file sizes smaller so pages load faster.

Implement lazy loading

Lazy loading means only loading the content users can see first like above-the-fold content. The rest of the content loads later, when users scroll down.

Use tools to check the page load time

You can use various online tools such as Pingdom, WebPagetest, or GTmetrix to check page load time. 

Google Analytics and Google Search Console provide insights into your website's page load times, and you can access this information under Behavior > Site Speed and Crawl > Crawl Stats, respectively.

Google Search Console and Bing Webmaster Tools also report on page timeouts, which occur when a page takes too long to load. 

You can find this information under Crawl > Crawl Errors in Google Search Console and Reports & Data > Crawl Information in Bing Webmaster Tools.

Avoid duplicate content

One of the best ways to optimize the crawl budget is to prevent duplicate content. Here are some ways to do that:

Setting up website redirects for all domain variants (HTTP, HTTPS, non-WWW, and WWW)

A website can be reached through different domain versions - like http, https, non-www, and www. This can cause duplicate content problems.

So it's important to redirect all the domain versions to one main one. For example, if a site works with both http and the more secure https, there should be a redirect from http to https.

This way, there is only one canonical site that search engines see.

Making internal search result pages inaccessible to search engines

Internal search result pages can generate a lot of duplicate content.

Website owners can block access to these pages by editing robots.txt. That way search engines avoid duplicates and focus on key site content.

Disabling dedicated pages for images

Some content systems like WordPress make separate pages just for images. These pages don’t have much real content though. They can lower the value of a website by duplicating content.

To avoid problems, website owners should disable these unnecessary image pages. This helps focus on key website content.

Being careful about categories and tags

Websites sometimes organize content using categories and tags. But problems can happen if these get out of hand.

Too many categories or tags clutter things up. It also repeats the same content unnecessarily.

So website owners should be careful when using categories and tags. Keep it simple with relevant groupings that truly aid navigation.

Work on URL parameters

Sometimes websites have pages with extra stuff in the URLs - called parameters. Lots of different URLs can confuse search engines and waste the crawl budget.

There are a few things site owners can do:

  • Use canonical tags - These tags tell search engines which URL is the main, preferred one. This avoids crawling duplicate content from the other URLs.

  • Block URLs in robots.txt - Owners can block search bots from crawling unimportant URLs with minor differences. This frees up the crawl budget.

  • Use Google's URL parameter tool - This tool tells Google which parameters matter or can be ignored. So Google knows which URLs to focus on.

Preventing Google from crawling your non-canonical URLs

Sometimes a website has the same content available at different URLs.

To fix it, site owners should pick one main URL and use a canonical URL. This tells search engines that's the preferred URL to show in results.

Owners can also update their robots.txt file. This blocks search bots from crawling the duplicate, non-main URLs.

Conclusion

In a nutshell, the crawl budget depends on the criterion set by Google that you can meet. Appropriate optimization of the crawl budget can provide your website with a generous crawl budget.

Never opt for bad practices to rank your site since they will temporarily benefit you.

If you own a small website, you must not worry about the crawl budget but in the case of large websites, it requires consideration.

Related Articles

Leave a reply
All Replies (0)