The robots.txt file provides a website owner with a lot of power when it comes to search engines, placing sections of the website off-limits to ethical crawlers. If you don’t know what a robots.txt file is, proceed with caution, one mistake at this level could block your entire site from appearing in search engines. There should be additional due diligence with this file as most of the errors I deal with are formatting or typos. The good news is that the most visible parts of your SEO will be screaming out something is wrong with your robots.txt file (if you have a broken robots.txt file) – allowing you to delve deeper into identifying and addressing the root cause. There is also a robots.txt file testing facility within Google Search Console which is really beneficial to use as it can help alleviate errors before you deploy them live, with live URLs.
The first symptom of having a problematic robots.txt file is a warning within the search engines, the error typically goes along the lines of “page cannot be displayed due to robots.txt file”. Due to the critical nature of robots.txt, I suspect all website auditing tools will identify major problematic issues within your robots.txt file. I believe this for two reasons, firstly most crawlers will consult your robots.txt file for direction and secondly, the emphasis on this text file is enormous. A blind-spot many SEOers will have is the ability to identify major issues against global visibility but not on a granular level. An example of this would be if your full website was restrictive to crawlers, this would flag a major error, but in the event, you blocked a significant section of your site, the crawler may not identify this as a major error – the diagnostics would likely lie in the search engine results or Search Console.
An underpinning key diagnostic tool for this type of error sits within the Google Search Console, if you go to Coverage and then toggle the Excluded tag – you can see blocked by robots.txt, allowing you to filter and assess each URL on its merit and requirement to be blocked.
The last place I always check is in the settings of Google Search Console and then Crawling information – this gives me an insight into where Googlebot is going and if there are any areas of the site which need looking at.
The first step when creating a robots.txt file is to assess which areas of your website are restrictive, sensative and public-facing. An example of this could be the checkout process whereby 3D authentication takes place something you may want to add within the robots.txt file as a disallow. For the other two categories, if you are struggling or are new, I’d let Google and other search engines identify what to crawl and index; there are robust systems in place to validate this for you – robots.txt file is good for narrowing down filters if they are excessive on your site for example. To deploy changes, simply copy and paste your robots.txt file into a text editor, make the changes and then rename the old robots.txt file to something like _old_robots_file.txt, this gives me a back up then deploy the new changes with your new robots.txt file (sat in the root directory of your FTP).
When you operate within a CMS there are usually robust robots.txt files generated, most SEO plugins, widgets and CMS modules that handle this file type usually do a great job of selecting what to allow. Some firewalls will also give you guidance on any nuisance crawlers which you may want to block off completely. If you rely on CMS specific robots.txt files these will usually serve you well – most errors come from manual involvement or website structures which are problematic in nature.
My additional notes on robots.txt file errors.
- A robots.txt file is just that, a text file for an indication – it is up to the crawler to utilise this data. There are some crawlers who will completely bypass this. There are no hard-coded rules when it comes to robots.txt files so any crawler can theoretically go anywhere if it chooses to ignore the file.
- The robots.txt file should not be used to mask a badly structured website, if you are spinning up filter URLs longer than the Eifel Tower then you need to address this problem. Setting rules around truncated URLs is often a symptom of a bad structure and one I look out for.
- Always include your XML sitemap in the robots.txt file, this is often shown in crawling software, I know SEMRush do, not really tested the others.
- Proceed with caution with this file, if you are using a live site that has traffic and rankings, get professional help. You can validate fixes with the Google Robots.txt testing tool but this only allows you to test one URL at a time and won’t guarantee that everything will be fine on the entire site.
- Be on the lookout for “This page cannot be indexed due to robots.txt” when looking at your site in the SERPs.
- Don’t forget you have the noindex and nofollow attributions at your fingertips on a page by page basis rather than using the robots.txt file as they can support different pages and directories collectively.
- If you go to coverage in Search Console and then look at valid pages, you can sometimes see pages stated as Indexed – although blocked by robots.txt file. These types of pages often highlight that there is some authority behind these pages for Google to index them anyways. This can sometimes happen if a link that appears throughout the site is blocked by robots.txt but is a key part of the internal linking structure or has a good external link profile.