Tuesday, 7 October 2014

Understanding the Importance of the robots.txt file

You’ve created a whole heap of relevant content for your website.  You’ve got some good in-bound links from high page ranking websites and your website is fully optimised for all the keywords and key-phrases your customers are searching upon – great.
But how is your robots.txt file doing?
This little file can make the world of difference to whether or not your site will get the page ranking it deserves.

What is the robots.txt file?

When search engine crawlers (robots) look at a website, the first file they will look at is not your index.html or index.php page.  It is your robots.txt file.
This little file that sits in the root “/” of your website contains instructions on what files the robot can and cannot look at within the website.
Here’s a typical robots.txt file example (line numbers are for illustration purposes only):
User-agent: *
Disallow: /cgi-bin/
Sitemap: http://www.mydomain.com/sitemap.xml.gz
OK, so what does the above example mean?  Let’s go through it line by line.
Line 1: The “User-agent: *” means that this section applies to all robots.
Line 2: The “Disallow: /cgi-bin/” means that you don’t want any robots to index any files in the “/cgi-bin/” directory or any of its sub folders.
Line 3: Left blank intentionally for aesthetics
Line 4: The “Sitemap: http://www.mydomain.com/sitemap.xml.gz” tells the robot that you have already indexed the structure of the website for mydomain.com.
So, as you can see from the example above, the robots.txt file contains instructions for the robot on how to index your website.

Do I need one?

No.  You don’t need a robots.txt file and most of the search engine robot crawlers will simply index your entire website if you don’t have one.  Actually, there is no requirement for any crawler to read your robots.txt file and indeed some malware robots that scan websites for security vulnerabilities, or email addresses used by spammers will pay no attention to the file or what is contained within.

So what’s all the fuss about?

Well.  There are two issues to address here; do you know if you have a robots.txt file and what it contains? And is there anything on your website you don’t want a robot to see?
Let’s look at them both in turn.

Do you have a robots.txt file and what’s inside it?

By far the easiest way of finding out if your website has a robots.txt file is to type in your website address with “/robots.txt” appended to the end such as: www.mydomain.com/robots.txt where mydomain.com is the name of your domain.
If you receive an “Error 404 Not found” page then there is no file.  It’s still worth reading the rest of this section though as we’ll see just how much damage a malformed file can do!
Ok  – if you haven’t got an error page displayed then there’s a pretty good chance your looking at your websites robot.txt file just now and that it is similar to the example a few sections ago.
Let’s just jump ahead a little and see how useful the file can be in protecting the sensitive parts of your website before we tackle the problems it can cause.

Got anything to hide?

If your website interacts with customers using forums, blogs, databases or if you have subscribers to newsletters etc. then all that sensitive and private data is being stored in a file somewhere on your website, whether it’s a database or configuration file doesn’t matter.
Search engine crawlers are a lot like simple insects.  They have a purpose in life to index website content and index they will – everything, unless instructed otherwise.
Private and sensitive data should always be encrypted when stored but in reality, for small business websites, it largely isn’t.  This may be because the particular software components your website use don’t have encryption abilities or because it’s was a speed versus security issue.
Regardless, a robot crawler will index all the plain text content in all the files on your website.  It has no morals.  So let’s give it some.
Just say, for example, you have “/newsletter” folder that contains all the regular newsletter emails you send out to all the website subscribers who’s email addresses and subscription passwords are stored in a “/newsletters/admin/subscribers.txt” file.
To get lots of good relevant content, you want the robot crawlers to index all your email newsletters, but certainly, you don’t want it to pick up on your subscribers email address or password.  Just image one of your subscribers searching Google with their email address and up comes your website www.mydomain.com #1 on the search page with their email address and password!  Yikes – that’s not good PR.
Thankfully you can use the robots.txt file to exclude parts of your website that shouldn’t be indexed.  In our example above, you would create a line such as “Disallow: /newsletters/admin/”
This means that anything within the folder “/newsletters/admin/” shouldn’t get indexed by robot crawlers that adhere to the standards.

The dangers of a robots.txt file

OK – as we’ve seen from the examples above, the robots.txt file assumes that everything on your website is game for indexing unless specified otherwise in the robots.txt file.
One of the biggest mistakes that people make is to disallow the root “/” of the website.  This is the starting folder for the entire website.  If you disallow this folder then you are effectively telling all the robots not to index any part of your website and this will be catastrophic for your marketing campaigns.  Check your file to make sure that robots are not being turned away at the front door.
Have a look through your website structure paying close attention to the folder names.  Sometimes you can pinpoint folders that could potentially contain sensitive and private data.  These are the ones you should be preventing robot crawlers from indexing.
Other types of folders that you don’t want a search engine robot crawler poking around in are those containing executables.  For example, your /cgi-bin/ or equivalent.  This folder can contain web programs that would normally be run by users of your website after, say, entering information into a web for, but if they are look at (sometimes the same as being run) by a robot crawler, can produce unwanted results.
An example of this would be the program your website uses to issue email newsletters.  If the program has been developed and tested correctly, then running it unexpectedly with no form input should not be a problem, but what if the program was developed in a rush and not tested 100% correctly.  A robot crawler activating such a program could cause it to behave in all sorts of strange ways.  Last thing you need is your 10,000 newsletter subscribers receiving 100’s of unwanted duplicate newsletters every day or week.
Also, highlighting the areas of your website that you don’t want robots to look in, does raise a flag of interest that potential malware robot crawlers could exploit.  Where better to look for sensitive data than the places you’re not meant to be? It’s a risk that you may have to take.

Best Practice

Dangers aside, pretty much all websites will have a robots.txt file to help control the indexing of content.
To get the most out of using a robots.txt file, try to adhere to these simple rules.
  1. If your website is static with no customer information – don’t use one
  2. Check that you are not disallowing the root folder “/”
  3. Make sure you disallow any folders that may contain private and sensitive data
  4. Disallow any folders containing executable web programs
  5. If your website has a sitemap already generated, add this to the file to help indexing.
  6. Don’t use comments in the file.
Visit our Website - www.3kits.com