In the last few weeks, I have been busy meeting up with several prospects to discuss about their online marketing efforts and how I would be able to assist from a SEO and content marketing perspective. Some know nothing about SEO and some know enough and think they are experts. Personally, I like to do a quick SEO audit of the prospective client’s website before I jump into the meeting to equip myself and leave them some gold nuggets during the meeting.

4 out of the 5 websites that I briefly looked at had used the meta robots control tags incorrectly. Majority of them didn’t understand the value of controlling robots on their website. In this post, I will try and explain what these tags are and how to accurately use them on your website.

Why do you need to control robots?

By default, search engines crawlers are programmed to crawl and index everything in their path, either through internal links or external links. I don’t know about you, personally I would like to be able to control of how the engines interact with my website. It allows for easy and controlled diagnosis if a problem arises.

Depending on the setup of a website, by controlling how the crawlers behave on the website, you can eliminate duplicate content problems and improve PageRank distribution (potentially getting a better crawl budget).

What are meta robots tags?

Similar to the robots.txt file, these meta tags are used to control the search engine crawler behaviour of the website. Meta robots tags reside in the <head> section of a webpage, for example (in bold):

<html>
<head>
<title>This is an example</title>
<meta name=”description” content=”This is an example of bla bla bla…” />
<meta name=”robots” content=”noindex,nofollow” />
</head>
….

It is important to note that these tags are directives for search engine crawlers. A directive basically means that they have no choice but to follow the direction of these tags.

Anatomy of the Tag

There are 2 attributes that make up the meta robots tag. Let’s decipher the meta tag:

“name=” attribute

This is where you specify the name for the meta data. Here are different values (bots) that you can use:

  • robots – most commonly used
  • googlebot – Google search engine crawler
  • googlebot-new – Google New search engine crawler
  • msnbot – Bing search engine crawler
  • More bots here (if you are interested)

“content=” attribute

This is where you provide specific commands to control how the bots behave. Below are the  most commonly used values:

  • Index
    Search engines are free to crawl index content on the webpage. This has the same behaviour as not having the meta robots tag present on the page.
  • Noindex
    Search engines are not allowed to include the page and content in their index.
  • Follow
    Search engines can proceed with their crawl through internal links that are on the webpage.
  • Nofollow
    Search engines should not crawl through any links on the website. This is not the same as the rel=nofollow tag.
  • Noarchive
    Search engines should not show the cached copy of the webpage. I cannot think of any situation where anyone would want to use this.
  • Noodp
    Search engines are blocked from using descriptions from the Open Directory Project in the SERPs.

Note: Values of these attributes are not case sensitive. If you have different cApiTaLisatiOns, it’s OK…don’t panic

You can use these as standalone values or use them in combination separated by commas. For example:

<meta name=”robots” content=”noodp” />
<meta name=”robots” content=”index,follow” />

Meta robots tags in action

For simplicity sake, I am only going to cover the use of the index, noindex, follow and nofollow values which are probably the most commonly used.  These values can be used in combination separated by commas to achieve different behaviours. Here are the different permutations and combinations:

<meta name=”robots” content=”index, follow” />

This behaves exactly the same as not having this tag on the page at all. So, if you want search engine crawlers to crawl and index the entire website, you can ignore the use of this tag.

Example:

index, follow

Example of “index, follow”

<meta name=”robots” content=”noindex, follow” />

With this tag, we are basically telling the crawlers NOT to index the page but follow the links that are on the page. There are a few scenarios where using this combination is applicable, on blog archives or internal search result pages (yes, internal search result pages do get indexed) for example. Here’s an example of a search result page that has used this tag. Very handy post Panda!

Example:

noidex, follow

Example of content=”noindex, follow”

<meta name=”robots” content=”index, nofollow” />

In this instance, we are telling the search engines to index the page but NOT to follow links that are on the page. Generally used when you want contents of that page to get indexed but not pass on any link value through links on the page and want to restrict crawlers from continuing their crawl. I rarely use this combination but when I asked the SEO community, there seems to be a few uses for this.

Example:

index, nofollow

Example of content=”index, nofollow”

<meta name=”robots” content=”noindex, nofollow” />

Here, we are basically telling search engines NOT to index the page and NOT to follow links on the page. Most commonly used to block crawlers from indexing test or staging website, coupled with the robots.txt file (of course).

Example:

noindex, no follow

Example of content=”noindex, nofollow”

Common myths that you should know about

Myth #1: If I just use robots.txt to disallow certain pages, search engines will not find these pages at all.

WRONG! Please take into consideration external links. Search engines can still find these pages if they are linked-to from other sources. In this case, only the URL will be indexed

Myth #2: This is a killer, I hear this all too many times. “Google cannot crawl your website properly because you did not use the “index, follow” meta robots tag”.

BULLSHIT! If the meta robots tags are missing from the website, it is business as usual for the search engines, they will attempt to crawl and index everything.

Myth #3: If you are redeveloping a website in a staging URL, using robots.txt to disallow the entire staging website means that search engine will 100% not able to access.

Unless it is password protected or have the “noindex, nofollow” meta robots tag on all pages, there is a still a risk that search engines might be able to find the staging website. The staging site could be linked-to by accident from external sources.

I hope you found this post useful and liked the illustrations that I used to try and get the message across. Please feel free to drop me a line in the comments box below if you have any feedback or questions.