Learn About Robots.txt



What is a Robots.txt file?

A Robots.txt file is a document that you place at the root of your website that instructs robots (also known as crawlers or spiders) parts of your site they can access. When a robot wants to visit a website it will first check the Robots.txt file to check what pages the website owner does not want it to visit.

Robots.txt examples

There are only two important commands you need to be familiar with to work with Robots.txt files User-agent and Disallow. Here’s an example of them in action:

User-agent: *
Disallow: /

Let’s take a look at what they mean. A User-agent is software acting on behalf of a user. For example, when you go on a website your computer sends a User-agent to the website you’re trying to access servers.

In the above, example User-agent: * selects all users agents meaning any robot that visits the website should respect the rule.

The command Disallow tells the robot what files on the website should they not visit. In the example above it’s blocking the entire website. A common use case for the above file is on a staging environment of a website.

Let’s look at another example:

# Rule 1
User-agent: *
Disallow: /user-accounts/
# Rule 2
User-agent: Googlebot
Disallow: /search/

In the above example, you’re familiar with the first line now. But the second is new, in this example where telling Googlebot, which is Google’s crawler to not visit the /search/ section of a website.

Robots.txt Limitations

It’s important to point out that you can’t enforce your Robots.txt instructions they are basically suggestions to robots. Most reputable crawlers, like Google, Bing and others will obey the instructions you provide, others may not.

Another thing to consider is that different robots may interpret your syntax differently and it’s important to read any relevant documentation to make sure that they will understand your file.

How to create a Robots.txt file

You might be wondering: “what’s the ideal Robots.txt file setup for SEO?”. Glad you asked, Let’s run through the best practices.

It must be called robots.txt

The naming convention is important your file must be called Robots.txt otherwise crawlers will not find the file.

Avoid: misspellings, such as robot.txt or storing it as a different file type for example robots.html.

Robots.txt must be stored on the root

To control robots behaviour for your entire website the file must be stored on the root of your website. To control crawling on http://www.example.com/ the file needs to be placed at http://www.example.com/robots.txt.

Avoid: storing the file in a subfolder, for example, http://www.example.com/folder/robots.txt. This will not give you full control of your website.

It must contain one or more rules

For your Robots.txt file to be valid it must contain at least one rule. There’s no rule limit and your file can contain as many rules as you need.

Avoid: making the mistake of thinking the rules aren’t case-sensitive because they are! Follow the documentation carefully.

Include the location of your sitemap

Sitemaps are a good way to indicate which content is important for robots to crawl. Sitemaps help robots prioritise crawling and find new URLs.

Avoid: not using the absolute URL of your sitemap’s location because robots won’t assume it’s http vs https or www vs non-www.

Here’s the sitemap file I’m using on my site:

User-agent: *
Disallow: /wp-admin/
Disallow: /thanks/

Sitemap: https://www.tomdonohoe.com.au/sitemap_index.xml

It stops robots visiting two sections of my sites that would be a waste of time crawling and provide no value to me. It also has the location of my sitemap.

Keep Learning:

Related Content: