Robots.txt file is a document that you place at the root of your website that instructs robots (also known as crawlers or spiders) parts of your site they can access. When a robot wants to visit a website it will first check the
Robots.txt file to check what pages the website owner does not want it to visit.
There are only two important commands you need to be familiar with to work with
Disallow. Here’s an example of them in action:
Let’s take a look at what they mean. A
User-agent is software acting on behalf of a user. For example, when you go on a website your computer sends a
User-agent to the website you’re trying to access servers.
In the above, example
User-agent: * selects all users agents meaning any robot that visits the website should respect the rule.
Disallow tells the robot what files on the website should they not visit. In the example above it’s blocking the entire website. A common use case for the above file is on a staging environment of a website.
Let’s look at another example:
# Rule 1
# Rule 2
In the above example, you’re familiar with the first line now. But the second is new, in this example where telling Googlebot, which is Google’s crawler to not visit the /search/ section of a website.
It’s important to point out that you can’t enforce your
Robots.txt instructions they are basically suggestions to robots. Most reputable crawlers, like Google, Bing and others will obey the instructions you provide, others may not.
Another thing to consider is that different robots may interpret your syntax differently and it’s important to read any relevant documentation to make sure that they will understand your file.
You might be wondering: “what’s the ideal Robots.txt file setup for SEO?”. Glad you asked, Let’s run through the best practices.
The naming convention is important your file must be called
Robots.txt otherwise crawlers will not find the file.
To control robots behaviour for your entire website the file must be stored on the root of your website. To control crawling on http://www.example.com/ the file needs to be placed at http://www.example.com/robots.txt.
For your Robots.txt file to be valid it must contain at least one rule. There’s no rule limit and your file can contain as many rules as you need.
Sitemaps are a good way to indicate which content is important for robots to crawl. Sitemaps help robots prioritise crawling and find new URLs.
Here’s the sitemap file I’m using on my site:
It stops robots visiting two sections of my sites that would be a waste of time crawling and provide no value to me. It also has the location of my sitemap.