Sélectionner une page

How To Create And Configure Your Robots.txt File

How To Create And Configure Your Robots.txt File

The Robots Exclusion Standard was developed in 1994 so that website owners can advise search engines how to crawl your website. It works in a similar way as the robots meta tag which I discussed in great length recently. The main difference being that the robots.txt file will stop search engines from seeing a page or directory, whereas the robots meta tag only controls whether it is indexed.

Placing a robots.txt file in the root of your domain lets you stop search engines indexing sensitive files and directories. For example, you could stop a search engine from crawling your images folder or from indexing a PDF file that is located in a secret folder.

Major searches will follow the rules that you set. Be aware, however, that the rules you define in your robots.txt file cannot be enforced. Crawlers for malicious software and poor search engines might not comply with your rules and index whatever they want. Thankfully, major search engines follow the standard, including Google, Bing, Yandex, Ask, and Baidu.

In this article, I would like to show you how to create a robots.txt file and show you what files and directories you may want to hide from search engines for a WordPress website.

The Basic Rules of the Robots Exclusion Standard

A robots.txt file can be created in seconds. All you have to do is open up a text editor and save a blank file as robots.txt. Once you have added some rules to the file, save the file and upload it to the root of your domain i.e. www.yourwebsite.com/robots.txt. Please ensure you upload robots.txt to the root of your domain; even if WordPress is installed in a subdirectory.

I recommend file permissions of 644 for the file. Most hosting setups will set up that file with those permissions after you upload the file. You should also check out the WordPress plugin WP Robots Txt; which allows you to modify the robots.txt file directly through the WordPress admin area. It will save you from having to re-upload your robots.txt file via FTP every time you modify it.

Search engines will look for a robots.txt file at the root of your domain whenever they crawl your website. Please note that a separate robots.txt file will need to be configured for each subdomain and for other protocols such as https://www.yourwebsite.com.

It does not take long to get a full understanding of the robots exclusion standard, as there are only a few rules to learn. These rules are usually referred to as directives.

The two main directives of the standard are:

  • User-agent – Defines the search engine that a rule applies to
  • Disallow – Advises a search engine not to crawl and index a file, page, or directory

An asterisk (*) can be used as a wildcard with User-agent to refer to all search engines. For example, you could add the following to your website robots.txt file to block search engines from crawling your whole website.

User-agent: *
Disallow: /

The above directive is useful if you are developing a new website and do not want search engines to index your incomplete website.

Some websites use the disallow directive without a forward slash to state that a website can be crawled. This allows search engines complete access to your website.

The following code states that all search engines can crawl your website. There is no reason to enter this code on its own in a robots.txt file, as search engines will crawl your website even if you do not define add this code to your robots.txt file. However, it can be used at the end of a robots.txt file to refer to all other user agents.

User-agent: *

You can see in the example below that I have specified the images folder using /images/ and not www.yourwebsite.com/images/. This is because robots.txt uses relative paths, not absolute URL paths. The forward slash (/) refers to the root of a domain and therefore applies rules to your whole website. Paths are case sensitive, so be sure to use the correct case when defining files, pages, and directories.

User-agent: *
Disallow: /images/

In order to define directives for specific search engines, you need to know the name of the search engine spider (aka the user agent). Googlebot-Image, for example, will define rules for the Google Images spider.

User-agent: Googlebot-Image
Disallow: /images/

Please note that if you are defining specific user agents, it is important to list them at the start of your robots.txt file. You can then use User-agent: * at the end to match any user agents that were not defined explicitly.

It is not always search engines that crawl your website; which is why the term user agent, robot, or bot, is frequently used instead of the term crawler. The number of internet bots that can potentially crawl your website is huge. The website Bots vs Browsers currently lists around 1.4 million user agents in its database and this number continues to grow every day. The list contains browsers, gaming devices, operating systems, bots, and more.

Bots vs Browsers is a useful reference for checking the details of a user agent that you have never heard of before. You can also reference User-Agents.org and User Agent String. Thankfully, you do not need to remember a long list of user agents and search engine crawlers. You just need to know the names of bots and crawlers that you want to apply specific rules to; and use the * wildcard to apply rules to all search engines for everything else.

Below are some common search engine spiders that you may want to use:

  • Bingbot – Bing
  • Googlebot – Google
  • Googlebot-Image – Google Images
  • Googlebot-News – Google News
  • Teoma – Ask

Please note that Google Analytics does not natively show search engine crawling traffic as search engine robots do not activate Javascript. However, Google Analytics can be configured to show information about the search engine robots that crawl your website. Log file analyzers that are provided by most hosting companies, such as Webalizer and AWStats, do show information about crawlers. I recommend reviewing these stats for your website to get a better idea of how search engines are interacting with your website content.

Non Standard Robots.txt Rules

User-agent and Disallow are supported by all crawlers, though a few more directives are available. These are known as non-standard as they are not supported by all crawlers. However, in practice, most major search engines support these directives too.

  • Allow – Advises a search engine that it can index a file or directory
  • Sitemap – Defines the location of your website sitemap
  • Crawl-delay – Defines the number of seconds between requests to your server
  • Host – Advises the search engine of your preferred domain if you are using mirrors

It is not necessary to use the allow directive to advise a search engine to crawl your website, as it will do that by default. However, the rule is useful in certain situations. For example, you can define a directive that blocks all search engines from crawling your website, but allow a specific search engine, such as Bing, to crawl. You could also use the directive to allow crawling of a particular file or directory; even if the rest of your website is blocked.

User-agent: Googlebot-Image
Disallow: /images/
Allow: /images/background-images/
Allow: /images/logo.png

Please note that this code:

User-agent: *
Allow: /

Produces the same outcome as this code:

User-agent: *

As I mentioned previously, you would never use the allow directive to advise a search engine to crawl a website as it does that by default.

Interestingly, the allow directive was first mentioned in a draft of robots.txt in 1996, but was not adopted by most search engines until several years later.

Ask.com uses “Disallow:” to allow crawling of certain directories. While Google and Bing both take advantage of the allow directive to ensure that certain areas of their websites are still crawlable. If you view their robots.txt files, you can see that the allow directive is always used for subdirectories and files and pages under directories that are hidden. As such, the allow directive should be used in conjunction with the disallow rule.

User-agent: Bingbot
Disallow: /files
Allow: /files/eBook-subscribe.pdf/

Multiple directives can be defined for the same user agent. Therefore, you can expand your robots.txt file to specify a large number of directives. It just depends on how specific you want to be about what search engines can and cannot do (note that there is a limit to how many lines you can add, but I will speak about this later).

Defining your sitemap will help search engines locate your sitemaps quicker. This, in turn, helps them locate your website content and index it. You can use the Sitemap directive to define multiple sitemaps in your robots.txt file.

Note that it is not necessary to define a user agent when you specify where your sitemaps are located. Also bear in mind that your sitemap should support the rules you specify in your robots.txt file. That is, there is no point listing pages in your sitemap for crawling if your robots.txt file disallows crawling of those pages.

A sitemap can be placed anywhere in your sitemap. Generally, website owners list their sitemap at the beginning or near the end of the robots.txt file.

Sitemap: http://www.yourwebsite.com/sitemap_index.xml
Sitemap: http://www.yourwebsite.com/category-sitemap.xml
Sitemap: http://www.yourwebsite.com/page-sitemap.xml
Sitemap: http://www.yourwebsite.com/post-sitemap.xml
Sitemap: http://www.yourwebsite.com/forum-sitemap.xml
Sitemap: http://www.yourwebsite.com/topic-sitemap.xml
Sitemap: http://www.yourwebsite.com/post_tag-sitemap.xml

Some search engines support the crawl delay directive. This allows you to dictate the number of seconds between requests on your server, for a specific user agent.

User-agent: teoma
Crawl-delay: 15

Note that Google does not support the crawl delay directive. To change the crawl rate of Google’s spiders, you need to log in to Google Webmaster Tools and click on Site Settings.

Webmaster Tools Site Settings

Webmaster Tools Site Settings can be selected via the cog icon.

You will then be able to change the crawl delay from 500 seconds to 0.5 seconds. There is no way to enter a value directly; you need to choose the crawl rate by sliding a selector. Additionally, there is no way to set different crawl rates for each Google spider. For example, you cannot define one crawl rate for Google Images and another for Google News. The rate you set is used for all Google crawlers.

Webmaster Tools Site Settings

Unfortunately, one crawl rate is applied to all search engine crawlers.

A few search engines, including Google and the Russian search engine Yandex, let you use the host directive. This allows a website with multiple mirrors to define the preferred domain. This is particularly useful for large websites that have set up mirrors to handle large bandwidth requirements due to downloads and media.

I have never used the host directive on a website myself, but apparently you need to place it at the bottom of your robots.txt file after the crawl delay directive. Remember to do this if you use the directive in your website robots.txt file.

Host: www.mypreferredwebsite.com

As you can see, the rules of the robots exclusion standard are straight forward. Be aware that if the rules you set out in your robots.txt file conflict with the rules you define using a robots meta tag; the more restrictive rule will be applied by the search engine. This is something I spoke about recently in my post “How To Stop Search Engines From Indexing Specific Posts And Pages In WordPress“.

Advanced Robots.txt Techniques

The larger search engines, such as Google and Bing, support the use of wildcards in robots.txt. These are very useful for denoting files of the same type.

An asterisk (*) can be used to match occurrences of a sequence. For example, the following code will blog a range of images that have logo at the beginning.

User-agent: *
Disallow: /images/logo*.jpg

The code above would disallow images within the images folder such as logo.jpg, logo1.jpg, logo2.jpg. logonew.jpg, and logo-old.jpg.

Be aware that the asterisk will do nothing if it is placed at the end of a rule. For example, Disallow: about.html* is the same as Disallow: about.html. You could, however, use the code below to block content in any directory that starts with the word test. This would hide directories named test, testsite, test-123 etc.

User-agent: *
Disallow: /test*/

Wildcards are useful for stopping search engines from crawling files of a particular type or pages that have a specific prefix.

For example, to stop search engines from crawling all of your PDF documents within your downloads folder, you could use this code:

User-agent: *
Disallow: /downloads/*.pdf

And you could stop search engines from crawling your wp-admin, wp-includes, and wp-content directories, by using this code:

User-agent: *
Disallow: /wp-*/

Wildcards can be used in multiple locations in a directive. In the example below, you can see that I have used a wildcard to denote any image that begins with holiday. I have replaced the year and month directory names with wildcards so that any image is included; regardless of the month and year it was uploaded.

User-agent: *
Disallow: /wp-content/uploads/*/*/holiday*.jpg

You can also use wildcards to refer to part of the URL that contains a certain character or series of characters. For example, you can block any URL that contains a questions mark (?) by using this code:

User-agent: *
Disallow: /*?*

The following command would stop search engines from crawling any URL that begins with a quote:

User-agent: *
Disallow: /"

One thing that I have not touched upon until now is that robots.txt uses prefix matching. What this means is that using Disallow: /dir/ would block search engines from a directory named /dir/ and from directories such as /dir/directory2/, /dir/test.html, etc.

This also applies to file names. Consider the following command for robots.txt:

User-agent: *
Disallow: /page.php

As you know, the above code would stop search engines from crawling page.php. However, it would also stop search engines from crawling /page.php?id=25 and /page.php?id=2&ref=google. In short, robots.txt will block any extension to the URL you block. So blocking www.yourwebsite.com/123 will also block www.yourwebsite.com/123456 and www.yourwebsite.com/123abc.

In many cases, this is the desired effect; however it is sometimes better to specify the end of a path so that no other URL’s are affected. To do this, you can use the dollar sign ($) wildcard. It is frequently used when a website owner wants to block a particular type of file type.

In my previous example of blocking page.php, we can ensure that only page.php is blocked by adding the $ wildcard at the end of the rule.

User-agent: *
Disallow: /page.php$

And we can use it to ensure that only the /dir/ directory is blocked, not /dir/directory2/ or /dir/test.html.

User-agent: *
Disallow: /dir/$

A lot of website owners use the $ wildcard to specify what types of images Google Images can crawl:

User-agent: Googlebot-Image
Allow: /*.gif$
Allow: /*.png$
Allow: /*.jpeg$
Allow: /*.jpg$
Allow: /*.ico$
Allow: /*.jpg$
Allow: /images/ 

My previous examples of blocking PDF and JPG files did not use a $ wildcard. I have always been under the impression that it was not necessary to use it, as something like a PDF, Word document, or image file, is not going to have any suffix to the URL. That is, .pdf, .doc, or .png, would be the absolute end of the URL.

However, for many website owners, it is common practice to attach the $ wildcard. During my research for this article, I was unable to find any documentation that states why this is necessary. If you any of you are aware of the technical reason for doing it, please let me know and I will update this article 🙂

Be aware that wildcards are not supported by all crawlers, therefore you may find that some search engines will not comply with the rules you define. Search engines that do not support wildcards will treat * as if its a character you want to allow or disallow.

Google, Bing and Ask, do actively support wildcards. And if you view the Google robots.txt file, you will see that Google use wildcards themselves.

Commenting Your Robots.txt Code

It is in your best interest to get into the habit of documenting the code in your robots.txt file. This will help you quickly understand the rules you have added when you refer to it later.

You can publish comments in your robots.txt file using the hash symbol #:

# Block Google Images from crawling the images folder

User-agent: Googlebot-Image
Disallow: /images/

A comment can be placed at the start of a line or after a directive:

User-agent: Googlebot-Image # The Google Images crawler
Disallow: /images/ # Hide the images folder

I encourage you to get into the habit of commenting your robots.txt file from the start as it will help you understand the rules you create when you review the file at a later date.

What to Place in a WordPress Robots.txt File

The great thing about the robots exclusion standard is that you can view the robots.txt file of any website on the internet (as long as they have uploaded one). All you have to do is visit www.websitename.com/robots.txt.

If you check out the robots.txt file of some WordPress websites, you will see that website owners define different rules for search engines.

Elegant Themes currently uses the following code in their robots.txt file:

User-agent: *
Disallow: /preview/
Disallow: /api/
Disallow: /hostgator

As you can see, Elegant Themes just blocks three directories from being crawled and indexed.

WordPress co-founder Matt Mullenweg uses the following code on his personal blog:

User-agent: *

User-agent: Mediapartners-Google*

User-agent: *
Disallow: /dropbox
Disallow: /contact
Disallow: /blog/wp-login.php
Disallow: /blog/wp-admin

Matt blocks a dropbox folder and a contact folder. He also blocks the WordPress login page and the WordPress admin area.

WordPress.org has the following in their robots.txt file:

User-agent: *
Disallow: /search
Disallow: /support/search.php
Disallow: /extend/plugins/search.php
Disallow: /plugins/search.php
Disallow: /extend/themes/search.php
Disallow: /themes/search.php
Disallow: /support/rss
Disallow: /archive/

Eight different rules are defined in WordPress.org’s robots.txt file and six of these rules refer to search pages. Their RSS page is also hidden, as is an archive page that does not even exist (which suggests it has not been updated in years).

The most interesting thing about the WordPress.org robots.txt file is that it does not follow the suggestions they advise for adding to a robots.txt file. They advise the following :

Sitemap: http://www.example.com/sitemap.xml

# Google Image
User-agent: Googlebot-Image
Allow: /*

# Google AdSense
User-agent: Mediapartners-Google

# digg mirror
User-agent: duggmirror
Disallow: /

# global
User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/cache/
Disallow: /trackback/
Disallow: /feed/
Disallow: /comments/
Disallow: /category/*/*
Disallow: */trackback/
Disallow: */feed/
Disallow: */comments/
Disallow: /*?
Allow: /wp-content/uploads/

The above code has been reproduced on thousands of blogs as the best rules to add to your robots.txt file. The code was originally published on WordPress.org several years ago and has remained unchanged. The fact that the suggested code disallows the spider of Digg illustrates how old it is (it is, afterall, several years since anyone worried about “The Digg Effect“).

However, the principles of the robots exclusion standard have not changed since the page was first published. It is still recommended that you stop search engines from crawling important directories such as wp-admin, wp-includes, and your plugin, themes, and cache directories. It is best to hide your cgi-bin and your RSS feed too.

Yoast noted in an article two years ago that it is better not to hide your website feed as it acts as a sitemap for Google.

“Blocking /feed/ is a bad idea because an RSS feed is actually a valid sitemap for Google. Blocking it would prevent Google from using that to find new content on your site.” – Yoast

As Jeff Starr correctly pointed out, you do not need to use the RSS feed as a sitemap if you have a functioning sitemap on your website already.

“Sure that makes sense if you don’t have a sitemap 😉 Otherwise, keeping your feed content out of search results keeps juice focused on your actual web pages.” – Jeff Starr

Yoast takes a minimal approach to robots.txt file. Two years ago, he suggested the following to WordPress users:

User-Agent: *
Disallow: /wp-content/plugins/

His current robots.txt file has a few additional lines, though by and large it remains the same as the one he previously suggested. Yoast’s minimal approach stems from his belief that many important pages should instead be hidden from search engine results by using a <meta name=”robots” content=”noindex, follow”> tag.

WordPress developer Jeff Starr, author of the amazing Digging Into WordPress, takes a different approach.

His current robots.txt file looks like this:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /comment-page-
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /blackhole/
Disallow: /mint/
Disallow: /feed/
Allow: /tag/mint/
Allow: /tag/feed/
Allow: /wp-content/images/
Allow: /wp-content/online/
Sitemap: http://perishablepress.com/sitemap.xml

In addition to blocking wp-admin, wp-content, and wp-includes; Jeff stops search engines from seeing trackbacks and the WordPress xmlrpc.php (a file that lets you publish articles to your blog via blog a client).

Comment pages are also blocked. If you don’t break your pages into comments, then you may want to consider blocking additional comment pages too.

Comment Page Seperation

The option to break comments into pages can be found in your WordPress discussion settings i.e. www.yourwebsite.com/wp-admin/options-discussion.php.

Jeff also stops crawlers from seeing his RSS feed, a blackhole directory he set up for bad bots, and a private directory named mint. Jeff makes a point of allowing tags for mint and feed to be seen, as well as his images and a directory named online that he uses for demos and scripts. Lastly, Jeff defines the location of his sitemap for search engines.

What to Place in Your Robots.txt File

I know that many of you are reading this article who simply want the code to place in your robots.txt file and move on. However, it is important that you understand the rules that you specify for search engines. It is also important to recognise that there is no agreed upon standard on what to place in the robots.txt file.

We have seen this above with the different approaches of WordPress developer Jeff Starr and Joost de Valk (AKA Yoast); two people who are rightfully recognised as WordPress experts. We have also seen that the advice given on WordPress.org has not been updated in several years and their own robots.txt file does not follow their own suggestion; instead focusing on blocking search functionality.

I have changed the contents of my blog’s robots.txt files many times over the years. My current robots.txt file took inspiration from Jeff Starr’s robots.txt suggestions, AskApache’s suggestions, and advice from several other developers that I respect and trust.

At the moment, my robots.txt file looks like this:

# Disallowed and allowed directories and files

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /comment-page-
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /feed/
Allow: /wp-content/uploads/

# Define website sitemaps

Sitemap: http://www.kevinmuldoon.com/sitemap_index.xml
Sitemap: http://www.kevinmuldoon.com/post-sitemap.xml
Sitemap: http://www.kevinmuldoon.com/page-sitemap.xml
Sitemap: http://www.kevinmuldoon.com/category-sitemap.xml
Sitemap: http://www.kevinmuldoon.com/author-sitemap.xml

My robots.txt file stops search engines from crawling the important directories that I discussed earlier. I also make a point of allowing crawling of my uploads folder so that images can get indexed.

I have always considered the code in my robots.txt file flexible. If new information arises that shows that I should change the code I am using, I will happily modify the file. Likewise, if I add new directories to my website or find that a page or directory is being incorrectly indexed, I will modify the file. The key is to evolve the robots.txt file as and when needed.

I encourage you to choose one of the above examples of robots.txt for your own website and then change it accordingly for your own website. Remember, it is important that you understand all the directives that you add to your robots.txt file. The Robots Exclusion Standard can be used to stop search engines crawling files and directories that you do not want indexed, however if you enter the wrong code, you may end up blocking important pages from being crawled.

The Maximum Size of a Robots.txt File

According to an article on AskApache, you should never use more use more than 200 disallow lines in your robots.txt file. Unfortunately, they do not provide any evidence in the article that states why this is the case.

In 2006, some members of Webmaster World reported seeing a message from Google that the robots.txt file should be no more than 5,000 characters. This would probably work out to be around 200 lines if we assume an average of 25 characters per line; which is probably where AskApache got this figure of 200 disallow lines from

Google’s John Mueller clarified the issue a few years later. On Google+, he said:

“If you have a giant robots.txt file, remember that Googlebot will only read the first 500kB. If your robots.txt is longer, it can result in a line being truncated in an unwanted way. The simple solution is to limit your robots.txt files to a reasonable size.”

Be sure to check the size of your robots.txt file if it has a couple of hundred lines of text. If the file is larger than 500kb, you will have to reduce the size of the file or you may end up with an incomplete rule being applied.

Testing Your Robots.txt File

There are a number of ways in which you can test your robots.txt file. One option is to use the Blocked URLs feature, which can be found under the Crawl section in Google Webmaster Tools.

Blocker URLs

Log in to Google Webmaster Tools.

The tool will display the contents of your website’s robots.txt file. The code that is displayed comes from the last copy of robots.txt that Google retrieved from your website. Therefore, if you updated your robots.txt file since then, the current version might not be displayed. Thankfully, you can enter any code you want into the box. This allows you to test new robots.txt rules, though remember that this is only for testing purposes i.e. you still need to update your actual website robots.txt file.

You can test your robots.txt code against any URL you wish. The Googlebot crawler is used to test your robots.txt file by default. However, you can also choose from four other user agents. This includes Google-Mobile, Google-Image, Mediapartners-Google (Adsense), and Adsbot-Google (Adwords).

Blocker URL's

The Blocked URLs took is useful for testing different robots.txt rules.

The results will highlight any errors in your robots.txt file; such as linking to a sitemap that does not exist. It is a great way of seeing any mistakes that need correcting.

Blocker URL's

Check the results of your robots.txt file to see if anything needs changed.

Another useful tool is the Frobee Robots.txt Checker. It will highlight any errors that are found and show if there any restrictions on access.

Frobee Robots.txt Checker

Frobee’s Robots.txt Checker is quick and eays to use.

Another robots.txt analyzer I like can be found on Motoricerca. It will highlight any commands that you have entered that are not supported or not configured correctly.

Motoricerca Robots.txt Checker

A user friendly robots.txt checker that checks every line of your robots.txt file.

It is important to check the code in your robots.txt file using a robots.txt analyzer before you add the code to your website robots.txt file. This will ensure that you have not entered any lines incorrectly.

Final Thoughts

The Robots Exclusion Standard is a powerful tool for advising search engines what to crawl and what not to crawl. It does not take long to understand the basics of creating a robots.txt file, however if you need to block a series of URL’s using wildcards, it can get a little confusing. So be sure to use a robots.txt analyzer to ensure that the rules have been set up in the way that you want them.

Also remember to upload robots.txt to the root of your directory and be sure to adjust the code in your own robots.txt file accordingly if WordPress has been installed in a subdirectory. For example, if you installed WordPress at www.yourwebsite.com/blog/, you would disallow the path /blog/wp-admin/ instead of /wp-admin/.

You may be surprised to hear that search engines can still list a blocked URL if other websites link to that page. Matt Cutts explains how this can occur in the video below:

I hope you have found this tutorial on creating a robots.txt file for your website useful. I recommend creating a robots.txt file for your own website and test the results through an analyzer to help you get a feel for how things work. Practice makes perfect 🙂

Should you want to learn more about creating and editing a robots.txt file, I recommend checking out the following resources for further reading:

Last but not least, be sure to subscribe to the Elegant Themes blog in order to get updates of our latest articles 🙂

Article thumbnail image by grop / shutterstock.com

Source link