What is Crawl Budget?
Crawl budget refers to the number of pages that Googlebot is willing to crawl on a website within a given timeframe.
The actual number of pages that Google crawls on a particular day may vary based on various factors such as the size and complexity of the website, its authority and relevance, and the availability of server resources.
Why is the crawl budget important?
A crawl budget is crucial for SEO because it determines how easily search engine crawlers can find and index your website's pages.
The crawl budget is important because it determines how effectively search engines can crawl and index a website's pages.
This, in turn, affects the website's visibility in search results and its potential traffic and revenue.
If a website has a high crawl budget, search engines like Google can crawl and index more pages quickly.
This means that new content is more likely to be discovered and indexed, leading to increased visibility in search results.
On the other hand, if a website has a low crawl budget, search engines may struggle to crawl and index all of its pages.
This can lead to some pages not being indexed or updated, which can hurt the website's visibility and rankings.
Ultimately, the crawl budget is important because it determines how effectively a website can be discovered and indexed by search engines.
A high crawl budget can help ensure that new content is identified and indexed quickly, while a low crawl budget can lead to missed opportunities for visibility and traffic.
How does the crawl budget work?
Google determines a website's crawl budget by taking into account two main factors: the crawl rate limit and the crawl demand.
The crawl rate limit is the maximum number of requests per second that a website can handle from Google's crawlers, while the crawl demand is the frequency and urgency with which Google wants to crawl a website's pages.
Crawl demand
Crawl demand is a key factor in determining how frequently and urgently Googlebot crawls a website's pages.
It can have a significant impact on a site's crawl budget, which is the number of URLs that Googlebot can and wants to crawl on a site.
Here are some important considerations related to the crawl demand:
Popularity
When it comes to the popularity of URLs on the internet, it's important to consider the level of traffic and social media engagement they receive.
The more popular a website is, the more frequently it will need to be crawled to ensure it stays fresh in Google's index.
This is where crawl demand comes into play, as websites with high traffic and engagement are likely to have a higher crawl demand than their less popular counterparts.
Staleness
The freshness of URLs is also a key factor in Google's indexing process.
Google's systems work hard to prevent URLs from becoming stale in the index, which means that pages containing time-sensitive information or those that are updated frequently may have higher crawl demand in order to keep the index up-to-date.
Site-wide events
Another factor that can impact crawl demand is site-wide events. Significant changes to a website, such as redesigns & development, may trigger an increase in crawl demand as Google works to reindex the content under new URLs.
It's important for website owners to keep these factors in mind as they work to optimize their sites for search engines and maintain a strong online presence.
By understanding the importance of crawl demand, freshness, and site-wide events, site owners can ensure that their content remains visible and relevant to users.
Crawl rate limit
The crawl rate limit is a crucial setting that plays a significant role in determining the rate at which Googlebot crawls a website.
This setting is responsible for regulating the number of parallel connections that Googlebot can use to crawl the site, as well as the time it needs to wait between fetches.
It's worth noting that there are a few factors that can impact the crawl rate limit.
Crawl Health
If a website is responsive and performs well over an extended period, the crawl rate limit may increase, thereby allowing Googlebot to crawl the site more extensively.
Conversely, if a website slows down or responds with server errors, the crawl rate limit may decrease, causing Googlebot to crawl less frequently.
The limit set in Search Console
Another factor that can impact the crawl rate limit is the limit set in the Search Console by website owners.
They can set limits on Googlebot's crawling of their site through the Search Console.
However, it's essential to note that setting higher limits doesn't always lead to increased crawling.
By understanding the crawl rate limit and how it impacts Googlebot's crawling of a website, website owners can optimize their site for efficient crawling.
This optimization will help ensure that their pages are being adequately indexed by Google, which is crucial for driving organic traffic to their website.
History of crawl budget
The concept of crawl budget has been around since at least 2009 when Google acknowledged the limitations of its resources and encouraged webmasters to optimize their websites for crawl budget.
At the time, Googlebot was only able to find and crawl a small percentage of the content that was available online, and it could only index a portion of that content.
Over time, SEOs and webmasters began to pay more attention to the crawl budget, recognizing its importance for ensuring that their websites were properly indexed by search engines.
In response to this growing interest, Google published a post in 2017 titled "What crawl budget means for Googlebot," which clarified how Google calculates crawl budget and how it thinks about the concept.
What factors does Google consider for crawl budget allocation?
Google considers several factors to determine the crawl budget allocated to a website.
Some of the main factors include:
Site size: Bigger sites require more crawl budget.
Server setup: A site's performance and load times may affect the crawl budget.
Update frequency: Google prioritizes content that gets updated regularly.
Links: Internal linking structure and dead links.
Faceted navigation and infinite filtering combinations: Faceted navigation can generate new URLs based on parameters selected, wasting the crawl budget.
Session identifiers and tracking IDs: Parameters used for analytics or user preferences through the URL may create duplicate pages.
On-site duplicate content, soft error pages, hacked pages, infinite spaces and proxies, low-quality, and spam content: Having many low-value-add URLs can negatively affect crawling and indexing.
What crawl budget encompasses other than pages?
Crawl budget refers to the number of pages or documents that search engines crawl on a website within a given timeframe.
Although the term "crawl budget" is often used in the context of HTML pages, it actually encompasses any type of document that search engines crawl.
This includes JavaScript and CSS files, mobile page variants, hreflang variants, PDF files, and other types of documents.
JavaScript and CSS files are important because they can impact the functionality and loading speed of a webpage.
Mobile page variants and hreflang variants are important for websites with multiple language or device-specific versions, as they allow search engines to identify and crawl the correct version of a webpage.
PDF files can also be crawled by search engines and can contain valuable information that needs to be indexed.
When optimizing for a crawl budget, it's important to consider all of these types of documents, not just HTML pages.
By ensuring that all important documents are crawlable and easily accessible to search engines, website owners can help improve their site's overall search engine visibility and organic traffic.
Who should care about the crawl budget?
As quoted earlier from Google:
“...if a site has fewer than a few thousand URLs, most of the time it will be crawled efficiently.
that means smaller websites with fewer pages are generally easier to crawl and index and Googlebot is unlikely to encounter resource constraints when crawling them.
However, the crawl budget is primarily a concern for larger websites with many pages or complex architectures.
Larger websites with numerous pages may need to optimize their crawl budget to ensure that their content is being crawled and indexed as efficiently as possible.
For large site owners, Google elaborates on managing the crawl budget.
This comprehensive guide can help websites to manage their crawl budget more efficiently.
How to check the crawl activity?
Website owners can check crawl activity for their websites.
Use Google Search Console
Google Search Console can be used to view your website's crawl budget, here are the steps to check your website's crawl budget using Google Search Console:
Log in to your Google Search Console account and select the website you want to check.
Click on the "Crawl" tab located on the left side of the screen.
Select "Crawl Stats" under the "Crawl" tab.
In this section, you can view the number of pages Google crawls daily, as well as other useful metrics like page download time and response codes.
Here, you can view the average number of pages that Google crawls on your website per day.
You can estimate the crawl budget by looking at the stats.
For example, let's say the average crawl budget for your website is 100 pages per day.
In theory, Google would crawl 3,000 pages on your website per month (100 pages x 30 days).
However, if your website has a large number of low-quality or duplicate pages, Google may prioritize crawling higher-quality pages, leading to a lower actual crawl rate.
On the other hand, if your website has grown significantly over time and now offers a wealth of high-quality content, Google may increase your crawl budget to keep up with the demand.
For instance, if your current average crawl budget is 500 pages per day, this would indicate a five-fold increase in the crawl budget compared to two years ago.
Check your server logs
Server logs are files that record every request made to your website, including the IP address of the requesting user and the pages or resources that were requested.
In addition to checking your crawl budget in Google Search Console, monitoring your website's server logs is recommended to get a more accurate view of how often Google's crawlers are visiting your website.
To check the crawl budget using server log files, you need to access the log files from your web hosting account or server.
These log files contain a record of all the requests that your server has received, including requests from search engine crawlers like Googlebot.
Once you have access to the log files, you can extract valuable insights.
For example, you can identify which pages on your site are being crawled most frequently, which pages are being crawled less often, and which pages are not being crawled at all.
Also, by analyzing your server log files, you can get a better understanding of how search engine crawlers are interacting with your site and make adjustments to optimize your crawl budget.
For instance, if you find that certain pages are being crawled too frequently, you can consider implementing caching or other optimization techniques to reduce the load on your server.
Conversely, if you find that certain pages are not being crawled at all, you may need to improve their internal linking or make other changes to make them more accessible to search engine crawlers.
What badly affects the crawl budget?
There are several factors that can negatively affect the crawl budget including
Duplicate content
Duplicate content refers to pages that are highly similar or identical and can negatively impact your crawl budget as search engines may not see these pages as valuable and may choose not to crawl them thereby affecting your crawl budget.
Some common causes of duplicate pages can be:
Faceted Search and Session IDs
Faceted search is a technique used in e-commerce and other types of websites to allow users to filter and refine their search results based on various attributes or facets such as price, color, size, and more.
However, the faceted search can create multiple URLs that display the same content but with different filter parameters, leading to duplicate content issues and potentially wasting Google's crawl budget.
Session IDs are unique identifiers that are used to track a user's session on a website, and they can also create multiple URLs with the same content.
If session IDs are not managed correctly, they can also cause crawl budget issues.
When it comes to the crawl budget allocation, Google considers various factors, including the website's authority, page speed, content quality, and crawl demand.
However, Google also looks at how efficiently the website uses its crawl budget by examining factors such as the number of unique URLs on the site, the crawl rate of the site, and the server response time.
Therefore, to optimize crawl budget allocation, website owners should pay attention to how faceted search and session IDs are implemented on their sites.
Multiple versions of the same page
If a website has multiple versions of the same page with different URLs, search engines may view these as separate pages, even though the content is the same.
This can lead to search engines allocating their crawl budget to crawl these duplicate pages, instead of focusing on other important pages on the website.
Here are some more examples of multiple versions of the same page with different URLs:
Different subdomains: For example, if you have the same content on www.example.com/page and example.com/page, search engines may see these as different pages and may allocate resources to crawl both URLs.
URL parameters: If you have the same content on http://example.com/page and http://example.com/page?utm_source=google&utm_medium=cpc&utm_campaign=summer_sale, search engines may view these as different pages and crawl both URLs, even though they contain the same content.
Case sensitivity: If your server is case sensitive, search engines may view URLs like http://example.com/page and http://example.com/Page as different pages and crawl both URLs.
Trailing slash: If you have the same content on http://example.com/page and http://example.com/page/, search engines may view these as different pages and crawl both URLs.
URL parameters and sorting/filtering options
When e-commerce websites have sorting and filtering options that create unique URLs for each option, it can lead to duplicate content issues.
This means that search engines may crawl and index multiple versions of the same page, which can waste their resources and negatively impact the crawl budget of the site.
For example, if an e-commerce site has multiple URLs for a single product page with different sorting options, such as by price or by popularity, search engines may have trouble determining which version to index.
Never-ending URLs
"Never-ending URLs" refer to URLs that continue to expand as users keep scrolling down a page, loading more content dynamically without refreshing the page.
These types of URLs are commonly used on social media platforms, news websites, and e-commerce sites that display a large amount of content on a single page.
Never-ending URLs can also lead to duplicate content issues, as search engines may interpret different versions of the same page as separate URLs, further wasting the crawl budget on duplicate content.
In addition, the endless crawling and indexing of pages can lead to a bloated index, which can negatively impact the site's overall search engine performance.
High numbers of non-indexable pages
Having a high number of non-indexable pages can significantly affect your website's crawl budget.
When search engine crawlers visit your website, they have a limited amount of resources and time to crawl and index all the pages on your site.
If a significant portion of your pages is non-indexable, it can lead to a waste of resources and time for the crawlers.
Here are some ways in which different types of non-indexable pages can negatively impact your website's crawl budget:
Redirects (3xx)
When search engine crawlers encounter redirects, they will follow the redirect and crawl the new URL.
However, if there are too many redirects, it can lead to a waste of resources as crawlers may not be able to crawl all the redirected pages in a single visit.
This can also impact the time it takes for search engines to discover new content on your site.
Pages that can't be found (4xx)
If your website contains a high number of pages that return a 4xx status code (e.g. 404 Page Not Found), it can lead to a waste of resources as crawlers may keep trying to crawl these pages repeatedly, thinking that they are temporary errors.
Pages with server errors (5xx)
If your website contains a high number of pages that return a 5xx status code (e.g. 500 Internal Server Error), it can signal to search engines that your website is experiencing technical issues.
This can negatively impact your website's crawl budget as search engines may reduce their crawl rate or stop crawling your site altogether until the issues are resolved.
Pages containing noindex directive
Pages that contain the robots noindex directive may not be crawled or indexed by search engines.
While these pages may not necessarily waste crawler resources, they can still negatively impact your website's crawl budget if there are a large number of them.
This is because search engines may spend time crawling and re-crawling these pages instead of discovering and indexing new content on your site.
Hacking
Hacking can have a significant impact on the crawl budget by creating a large number of non-indexable pages, redirects, and malicious content on the website.
This can result in a decrease in crawl frequency and a decrease in search engine rankings.
Some ways hacking can harm crawl budget include:
Injecting code or content
Hackers may inject malicious code or content into a website to manipulate search engine rankings or to install malware on users' machines.
This can lead to search engines crawling irrelevant pages, reducing the crawl budget, and negatively affecting search rankings.
Creating new pages
Hackers can create new pages on a website with spammy or malicious content, which can harm the site's visitors and its performance in search results.
Search engines may crawl these pages and reduce the crawl budget of the website.
Manipulating existing pages
Hackers can also subtly manipulate existing pages by adding hidden links or text to a page using CSS or HTML. This can be harder for users to spot, but search engines can still see it and may reduce the crawl budget of the website.
Redirects
Hackers may inject code that redirects users to harmful or spammy pages, which can harm the site's visitors and lead to a decrease in the crawl budget.
Low-quality content
Low-quality content refers to pages that have very little content or provide little value to users.
This type of content can negatively impact a website's crawl budget because search engines may view these pages as unimportant or not worth crawling.
Low-quality content can include thin or duplicate content, keyword-stuffed content, and content that does not meet the user's intent.
One example of low-quality content is a FAQ section where each question and answer is served over a separate URL.
This type of content not only creates a large number of low-quality pages but also generates multiple URLs for essentially the same content.
This can lead to duplicate content issues and confusion for search engines as they try to determine which URL to crawl and index.
Pages with high load time
Page load time refers to the amount of time it takes for a web page to fully load and display all of its content in a user's web browser.
It includes the time it takes to retrieve all the necessary files, such as HTML, CSS, JavaScript, images, and other multimedia content.
When a page takes a long time to load or doesn't load at all, search engines may view it as an indication of poor website performance.
As a result, the search engine may reduce the number of pages it crawls on the website, which can negatively impact the website's crawl budget.
Bad internal link structure
Internal link structure refers to the way in which pages on a website are linked to one another. It includes the organization of pages within the site's hierarchy and the links that connect them.
The internal link structure of a website is an important factor that affects the crawl budget.
If the internal link structure is poorly set up, it can result in certain pages not getting enough attention from search engines, which can negatively impact the crawl budget.
Pages that have few internal links may not be frequently crawled by search engines, as they may be viewed as less important.
Pages at the bottom of a very hierarchical link structure may also be neglected by search engines due to their limited amount of links.
This means that search engines may not crawl these pages as often, resulting in a lower crawl budget for those pages.
On the other hand, pages with a lot of internal links are more likely to be crawled frequently by search engines.
This means that these pages will have a higher crawl budget compared to pages with few internal links.
Incorrect URLs in XML sitemap
An XML sitemap is a file that lists all the pages on a website that a webmaster wants search engines to index.
It helps search engines crawl a website more efficiently by providing a roadmap of all the pages.
Incorrect URLs in XML sitemaps can significantly impact crawl budget optimization.
If a website's XML sitemap contains URLs that no longer exist or have been redirected, search engines may spend their crawl budget trying to access these pages.
This can result in a wasted crawl budget and slow down the indexing of the website's important pages.
On the other hand, if a website's XML sitemap excludes important pages, search engines may not crawl them, limiting the website's visibility in search results.
This can also affect the website's crawl budget negatively as search engines may not crawl all the pages that they could have crawled if the XML sitemap was correctly structured.
Regularly checking the XML sitemap for errors and making sure it only contains URLs of indexable pages is crucial for efficient crawl budget management.
Crawl budget optimization
You can optimize the crawl budget better if you keep certain technicalities in mind that you may overlook otherwise.
Build an organized and updated XML sitemap
An XML sitemap is particularly useful for large or new websites with many pages. To access the XML sitemap, you can simply add "/sitemap.xml" after the main URL of the website.
It's important to note that the XML sitemap is not intended for users, as it is written in a specific format that search engine bots can understand, rather than HTML, which is more user-friendly.
Therefore, it's important to include only indexable, updated, and essential pages in the sitemap, and to update it regularly.
After creating the sitemap, it should be uploaded to the Root Directory and the sitemap should be submitted to Google using Google Search Console.
Fix broken links
Broken links are hyperlinks on a web page that no longer work, typically because the destination page has been deleted or moved or its URL has changed.
When search engines encounter broken links or long chains of redirects, they are unable to reach the intended page and essentially hit a "dead end."
This can waste the crawl budget as search engines spend time attempting to access pages that are no longer available or are redirected multiple times, reducing the number of pages they can crawl on your site.
By fixing broken links and reducing the number of redirects, you can quickly recover the wasted crawl budget and improve the user experience for visitors to your site.
To fix broken links and optimize your crawl budget, you can follow these steps:
Identify broken links
Use a broken link checker tool or crawl your website regularly to identify any broken links.
Prioritize fixing high-priority broken links
Fix links that are important for user experience or lead to important pages on your website first.
Update external links
After fixing internal broken links, update any external links pointing to the broken link.
Redirect broken external links
If a broken link is pointing to an external website, consider redirecting it to a working page on a similar topic or replace it completely with a new URL.
Use 301 redirects
For broken links that cannot be fixed, use a 301 redirect to point the link to a working page. This ensures that any traffic or search engine value from the broken link is redirected to a working page on your website.
Monitor and recheck regularly
Continuously monitor your website for broken links and recheck previously fixed links to ensure they remain working.
Carefully build an internal link structure
Having a solid internal link structure is crucial for optimizing the crawl budget and improving the user experience on your website.
One important aspect of internal linking is ensuring that your most important pages have plenty of links pointing to them.
This helps search engines understand the hierarchy and importance of your content.
When it comes to older content that still drives traffic, it's important to make sure that it's still easily accessible to both users and crawlers.
One way to do this is by linking to the older content from newer articles, blog posts, or other relevant pages on your website.
This helps to keep the older content at the forefront of your internal link structure and ensures that it remains visible and accessible to both users and crawlers.
Fast load time
One of the ways to optimize the crawl budget is by ensuring that your website pages have fast load times.
Pages that load in more than 5 seconds are problematic and may negatively impact the user experience.
Ideally, your pages should load in under 5 seconds to enhance user satisfaction.
To monitor your website's page load times, you have to consider the following points:
Improving page load times by optimizing your JavaScript
One of the most important factors that affect user experience and search engine rankings is page speed.
JavaScript can often be a source of slow page load times, so optimizing it can have a significant impact.
This can involve reducing the size of JavaScript files, minimizing the number of requests made by JavaScript by combining them and deferring JavaScript loading.
Optimize images
Images can significantly impact the loading time of a webpage. You can optimize images by compressing them, reducing their file size, and choosing the right file format.
Minimize HTTP requests
Each request made to the server takes time to complete, so minimizing the number of HTTP requests can improve load time.
You can do this by combining multiple files into one or using CSS sprites.
Reduce server response time
The time it takes for the server to respond to a request can also impact page load time. To reduce server response time, you can use a content delivery network (CDN), reduce the number of plugins or scripts running on the server, and use caching.
Minify code
Minifying code involves removing unnecessary characters, such as white space and comments, from HTML, CSS, and JavaScript files. This reduces file size and improves page load time.
Implement lazy loading
Lazy loading involves loading only the content that is visible to the user in the first place like above-the-fold content while delaying the loading of other content until the user scrolls down the page.
Use tools to check the page load time
You can use various online tools such as Pingdom, WebPagetest, or GTmetrix to check page load time.
Google Analytics and Google Search Console provide insights into your website's page load times, and you can access this information under Behavior > Site Speed and Crawl > Crawl Stats, respectively.
Google Search Console and Bing Webmaster Tools also report on page timeouts, which occur when a page takes too long to load.
You can find this information under Crawl > Crawl Errors in Google Search Console and Reports & Data > Crawl Information in Bing Webmaster Tools.
Avoid duplicate content
One of the best ways to optimize the crawl budget is to prevent duplicate content. Here are some ways to do that:
Setting up website redirects for all domain variants (HTTP, HTTPS, non-WWW, and WWW)
It's important to redirect all versions of your domain to a single canonical version to avoid duplicate content issues. For example, if your website is accessible through both http and https, set up a redirect from the http version to the https version.
Making internal search result pages inaccessible to search engines
Internal search result pages can generate a lot of duplicate content. To prevent search engines from crawling and indexing them, add the relevant directives to your robots.txt file.
Disabling dedicated pages for images
Many content management systems like WordPress create dedicated pages for images.
These pages often have little or no content and can dilute the value of your website. To prevent duplicate content issues, disable these pages.
Being careful about categories and tags
Taxonomies like categories and tags can create duplicate content issues if they're not used correctly.
Avoid creating too many tags or categories and try to keep them organized and relevant to your content.
Work on URL parameters
To fix crawl budget issues related to URL parameters, you can take several steps:
Use canonical tags: If you have multiple versions of a page with different URL parameters, use canonical tags to indicate the preferred version. This will help search engines identify the main version of the page and avoid crawling duplicate content.
Use the robots.txt file: You can use the robots.txt file to block search engines from crawling pages with certain URL parameters. For example, if you have pages with sorting or filtering parameters that don't change the content significantly, you can disallow the crawling of those pages using the robots.txt file.
Use the URL parameter tool in Google Search Console: Google Search Console provides a URL parameter tool that allows you to specify how search engines should handle different URL parameters. You can use this tool to tell search engines which parameters should be ignored or which pages should be crawled.
By implementing these steps, you can help search engines crawl and index your site more efficiently, and improve your site's visibility in search results.
Preventing Google from crawling your non-canonical URLs
When you have multiple versions of the same content on different URLs, it's important to choose one URL as the canonical version and use the canonical tag to indicate to search engines that it is the preferred version.
This can help prevent duplicate content issues and improve the overall ranking of your website.
Additionally, you can use the robots.txt file to block search engine crawlers from accessing non-canonical URLs, which can help prevent them from appearing in search results and potentially diluting your website's authority.
Conclusion
Crawl budget is an essential factor in search engine optimization that determines how efficiently search engines crawl and index your website.
By optimizing your crawl budget, you can ensure that search engines focus on crawling and indexing your most important pages, which can lead to higher search engine rankings, improved user experience, and increased website traffic.