A Robot.txt file is one of the most straightforward files on your website. This file though very basic in nature, can be easily messed up. Here’s a fun fact: many people don’t know a robot.txt even exists
A small mistake in this file can wreak havoc by making your website inaccessible to search engines. Unfortunately, if your website is inaccessible to search engines, it means that it will be unavailable to the users as well.
Using a Robot.txt file correctly can help in boosting the search engine rankings of a website. However, to optimize a Robot.txt file, first, you need to learn what a Robot.txt file is and why it is so important.
Table of Contents
What is a Robot.txt file?
A Robot.txt file is a file that notifies search engines about which part of your website they can and can’t access. Primarily, a Robot.txt file is used for hiding content from search engines and visitors.
In more technical terms, a Robot.txt file is a text file created to guide web robots on how to crawl the web pages of a website. A Robot.txt is a part of REP. REP is a group of standards that regulate how robots crawl websites.
A Robot.txt file helps web robots determine what parts of a website they can and can’t access.
The only downside to using a Robot.txt file is that some search engines ignore the commands provided in this file.
Why should you learn about a Robot.txt?
When a search engine is about to crawl a website, the bots visit the Robot.txt before moving to any target page. This Robot.txt file provides instructions to the web bots.
Web bots follow these instructions for crawling a website. If the instructions provided in your Robot.txt are not correct, search engines will not be able to crawl and index your website properly.
To identify any mistakes in your Robot.txt, you first need to learn how a Robot.txt file works.
Robot.txt file is not equally important for the success of every website. In fact, many SEOs suggest that some websites don’t even need a Robot.txt file. The reason being that search engines like Google can usually index all the crucial web pages on your website.
Some search engines automatically don’t index web pages with duplicate content or different versions of the same content.
Having said that, we believe every website should take advantage of the Robot.txt file for the following.
- For Blocking web pages that you want to hide from the public eye
Every website has some web pages that should not be indexed. These web pages can include the staging version of new web pages and a login page for the admin. These pages are important because you can’t run your website without them. You also don’t want these pages to be visible to the public eye.
You can use a Robot.txt file to prevent random visitors from landing on these web pages. When you block the search engines from crawling these web pages, they will automatically become inaccessible for visitors.
- Maximizing the crawl budget
Crawl budget is the number of web pages crawl bots can index within a time frame. You might have a crawl budget problem if you face trouble getting all your web pages indexed. This problem occurs when a website has a large number of web pages.
The easiest way to solve this problem is by blocking all the unimportant web pages from getting indexed. When you block all these web pages, search engines like Google will spend your crawl budget on the web pages that matter the most.
- Preventing indexing of multimedia resources
Meta directives work just as well as Robot.txt, and due to this reason, many prefer using Meta directives to prevent certain web pages from getting indexed by the search engines. But you can’t use Meta directives for blocking multimedia resources.
Robot.txt file is considered superior when it comes to multimedia like PDFs and Images.
Pro tip: Use the Google search console for looking at the number of indexed web pages your website has. If this number is identical to the number of web pages you wanted the search engine to index, you don’t need to change anything in your Robot.txt file.
What Does a Robot.txt file look like?
Here is a very basic version of a Robot.txt file
In this robot.txt file, the ‘*’ after user-agent means that this file applies to all the robots visiting this website.
The Disallow command prevents robots from indexing the mentioned web pages. In this Robot.txt file, we have commanded the robots to skip crawling and indexing “wp-admin” and “wp-content plugins.”
This file may seem a little daunting to those who have never seen a Robot.txt file before. But you can easily learn the ins and outs of this file if you dedicate a little time to it.
Following are the two main components of a Robot.txt file
All search engines identify themselves with the help of a user agent. Website owners can customize instructions for these search engines using a Robot.txt file. There are tens and hundreds of user agents that can be used. Here are some of the common ones
You can also use an asterisk for assigning directives to all the user agents.
If you want to block all bots except the Bingbot, here’s how you can do it
In this file, user-agent: * is commanding all bots and disallowing them from crawling your website.
The user-agent: Bingbot command is telling Bingbot that it can crawl your website.
One thing to keep in mind is that any new user-agent command acts like a clean slate. This statement means that if there are two user-agent commands for the same robot, the robot will follow the one on the bottom.
You can also see it in the example given above, Bingbot ignored the command given above because the command on the bottom acted like a clean slate.
There is one exception to this rule, and it is applied when you declare a user-agent more than once. When the same user-agent is declared multiple times, all directives are combined and followed.
Important note: Some bots, such as the Googlebot, ignores commands that are less specific, for example, the user-agent:* command. Robots like Googlebot only follow commands that are specifically directed towards them.
Directives are the rules and regulations that you want the bots to follow. Different search engines support different directives. Here are the ones that Google supports
This directive is used to instruct search engines not to crawl the mentioned web pages. You can also use this directive to prevent search engines from crawling specific files.
This Robot.txt file is commanding all search engines not to crawl your blog.
The allow directive is used to allow search engines to crawl specific parts of your website, such as the subdirectory. This command is used for an otherwise disallowed directory.
If you want to disallow search engines from crawling and indexing all your blog posts except for one, here’s how you can do it
You can use a similar command for allowing search engines to crawl any part of your website that was otherwise disallowed.
Both Google and Bing support this directive.
This directive is used to specify the location of a website’s sitemap to search engines. Sitemaps generally include the web pages that you want the search engines to crawl and index. However, many websites don’t have a sitemap. If your website is one of those, you can use a sitemap generator.
This is how you can include your sitemap in the Robot.txt file
Adding a sitemap in your Robot.txt file is a good practice because it helps search engines locate the sitemap. If your sitemap is already submitted through the Google search console, adding it to the Robot.txt file will not help Google, but it will help other search engines such as Bing.
You don’t need to repeat this directive for every user agent. Instead, just add it at the beginning or end of your robot.txt file.
- Finding Robot.txt file on your website
Follow this method for taking a quick look at your Robot.txt file. This method will work for almost all websites. Using this method, you can also take a look at the Robot.txt files of your competitor websites.
Start by opening up your browser and type in the URL of your website
In this case, we are adding the URL of Ahrefs
Follow this address by /robot.txt.
In the first scenario, you’ll find the Robot.txt file
Just like we did
Here take a look at you tube’s Robot.txt file
In another case, you’ll find a blank page
Just like we did by typing in
Or in the third case scenario, this method can return a 404
If you find a blank page or a 404 when searching for your Robot.txt file, you’ll have to resolve this problem.
If you find a Robot.txt file, it means that your file is set on default settings.
This method is perfect, especially for comparing your Robot.txt file to your competitors’.
If you don’t find a valid Robot.txt file, follow the instructions given below.
Creating your Robot.txt file
Creating a Robot.txt file is not difficult. Start by opening your text editor and try to avoid using Microsoft word for this purpose.
Most websites have a default Robot.txt file, and you can find yours in your website’s root directory.
Once you locate your file, open it for editing and delete all the previously present text.
For creating a Robot.txt file, you’ll have to learn about some of the basic syntaxes first.
You can learn about the syntax used in Robot.txt in a few searches. Make sure you are only reading the information provided by reliable sources.
We’ve discussed most of these terms in this article before, so we’ll move onto making your basic Robot.txt file.
In this file, we’ve only used the terms that are described in detail above.
Now is the fun part. Let’s optimize this file for search engines to boost the SEO of your website.
- Optimizing your Robot.txt file
The optimization of your Robot.txt file will depend upon your website’s content. There are a couple of ways in which you can use your Robot.txt file to benefit your website.
Before we start optimizing your Robot.txt file, one thing you should keep in mind is that your Robot.txt file should never be used for completely blocking web pages from search engines.
The best way to use your Robot.txt file is to optimize it for maximizing the crawl budget. You can do this by commanding search engines not to crawl the web pages of your website that are hidden from the public eye.
Disallow all your admin login pages for maximizing your crawl budget. Also, disallow any printer-friendly versions of your web pages. Finally, disallow all the thank you pages if your website has them.
You should also disallow all the web pages that are not visible to the public. Then, once you are done, you can test everything out.
You can use your webmaster account for this purpose because Google provides a free Robot.txt tester which you can access via the webmaster.
And that is all you need to do
I know this topic seemed a lot complicated at first glance but making and testing a Robot.txt file is really easy. You just need to get the hang of its syntax.
Lastly, we will learn about the errors you can face while using a Robot.txt file and how you can remove these errors.
- How to solve errors related to your Robot.txt file?
Many of us have faced the “indexed though blocked by Robot.txt” error. It is one of the most common errors related to a Robot.txt file, and it occurs when Google has indexed URLs that it wasn’t allowed to crawl and index.
The easiest way to remove this problem is by adding a noindex Meta robots tag in the head of your web page that you don’t want the search engines to crawl.
Google may accidentally ignore the disallow command in your Robot.txt file, but it can never index a web page containing a noindex tag in its header.
This method will only be applicable if you don’t want this web page to get indexed. But if you do, you should look at the reasons due to which Google is returning this error. Most likely, you’ll find some crawl blocks, and you’ll have to remove them.
Follow the below-mentioned steps for finding any crawl blocks.
- Check if there are any crawl blocks in your robot.txt file.
- Check if there are any intermittent blocks.
- Check if there are any user-agent blocks.
- Check if there are any IP blocks.
Once you have found the crawl block remove it, and if you are unable to do it yourself, you can always get help from a professional.
You don’t always have to bend over backward to improve the SEO of your website. There are a couple of tips and tricks that can help improve the SEO of your website with little to no effort. One of these hacks is optimizing your Robot.txt file for gaining maximum benefit from the crawl budget.
By optimizing your Robot.txt file, you are improving the SEO of your website and helping your visitors navigate your website better.
An optimized Robot.txt file will help the search engines spend their crawl budget the best way, and, in turn, the search engine rankings of your website will improve.
Setting up a Robot.txt file is pretty easy, and you can quickly set it up within a few hours. Another great thing about the Robot.txt file is that you’ll only have to set it up once.
Once you have made your Robot.txt file, you can occasionally revisit it to see if everything is performing well or not.
The most important thing to remember is that no matter how old or new your website is. An optimized Robot.txt file can make a significant difference in its search engine rankings.