The robots.txt File
What a robots.txt File Does
Having a robots.txt file controls where robots or "spiders"(same thing, different name) go on your website. A spider is simply a program that accesses your web site and follows all your links looking for information. The most common of these are search engine robots. There may be parts of your site that you don't want displayed in search results from search engines. To keep this from happening, you need a robots.txt file.
When a spider visits your web site, the first file it looks for is a file called robots.txt in your main directory. Here is the URL of the robots.txt file for this site:
http://www.daves-web-help.com/robots.txt
Notice that it is in the base directory. By default, anyone can view a robots.txt file and see the parts of your web site you don't want them to see. Find out how to keep prying eyes out of your directories below.
After the spider digests the spider food in the robots.txt file, it crawls through your site based on the rules you have put in place.
Note: Not all spiders adhere to the rules in the file, or even get the file and look at it. These are called rogue spiders. The rogue spiders crawl around anywhere and everywhere they feel like, gathering links, email addresses, and anything else they may find interesting and generally wasting your bandwidth. There are ways to block these spiders, but it requires blocking them through other methods such as server-side coding and using .htaccess files. Check the additional resources at the bottom of this page for more information on banning rogue spiders.
Do You Need a robots.txt File
If you read the section above and are still not sure if you need a robots.txt file, ask yourself this question. Do you have some content that is updated often? If you do, you may want to keep search engines from indexing that content. You can also keep spiders from accessing and indexing your image files or javascript files to save bandwidth when the spider is crawling your site.
At the very least you should at least have a robots.txt file with this code in it if you want all spiders to access your entire web site.
User-agent: *
Disallow:
By doing this, you won't have a bunch of unnecessary 404(Not Found) error codes when spiders ask for a robots.txt file.
If you want to keep all spiders out of certain directories, it is best to password protect those directories.
The Exclusion Standard
By default, a spider (robot) is allowed to crawl and index your entire site. What you do is exclude directories and files from the spider's crawl. Think of it like cutting out a section of a real spider's web that you don't want the spider crawling on (I guess you'll just have to ignore the fact that the spider built it's web and you have no business messing with it).
First off, to create a robots.txt file, open notepad and save it with the filename robots.txt. If you don't have notepad, use any text editor and save it as robots.txt and save it as type "Text Only" or whatever otion you may have that doesn't save any formatting. OK, easy part done.
Here is the syntax you will use to block spiders from your files and directories.
A hash (pound) sign says everything else on this line is a comment
# This is a comment
This line indicates all robots are to follow these rules
User-agent: *
This line indicates that the googlbot is to follow these rules
User-agent: googlebot
This line would exclude your entire site from the robot
Disallow: /
This line would exclude a file in your main directory called myfile.htm
Dissallow: myfile.htm
This line excludes a directory called "example" and all it's
subdirectories
Disallow: /example/
This line would exclude the spider from accessing a file called this.htm
and a directory called "this" but not necessarily all of it's
subdirectories.
Disallow: /this
This line excludes a file called test.htm in directory "other-example"
Disallow: /other-example/test.htm
So let's say you want Google's robot to go anywhere, the Araneo robot to go away and not crawl or index any part of your site, and any other robot to stay out of directory "no-unknown-bots". Let's say no other robots are allowed to access a page in directory "logs" called "only-google.htm". Here would be your robots.txt file.
User-agent: googlebot
Disallow:
User-agent: araneo
Disallow: /
User-agent: *
Disallow: /no-unknown-bots/
Disallow: /logs/only-google.htm
Look in the additional resources below to find a link to a robots.txt validator to make sure your syntax is correct. Note that the (*) means "any other robot" not "all robots".
You are only allowed one user-agent per statement block. Here is an example of this and it is NOT OK:
User-agent: googlebot, araneo
Disallow: /
-or-
User-agent: googlebot
User-agent: araneo
Disallow: /
How to Keep Out Prying Eyes
If you are going to have a robots.txt file, chances are good that some people might want to paruse the directories and view your files. Here is what I have done to keep this from happening. If it is a directory I just don't want indexed, then I don't care if anyone sees it. If I don't really want people in there, I just create a file in the directory called index.htm and make it loop back to my home page and presto, noone can browse my directory. This isn't going to work if you put a specific file in your robots.txt file and you don't want anyone viewing it. The best thing to do is to put the file in a directory of it's own and bar the directory from being accessed.
Spiders can't see what isn't linked to, so if you have a directory that you don't want them to go in and there are no links to it, it is best to just leave it out of the robots.txt file. The reason being is that some rogue spiders will look at your robots.txt file just to find out about directories that aren't linked to.
Another way to keep people from viewing the contents of your directories is to use .htaccess if you have it available to you. Here is the syntax you need to add various options to your .htaccess file.
To turn indexes off in your directories, use this line of code:
Options -Indexes
If your server has indexes disabled and you want to turn indexes on, or if you have a .htaccess file in a directory that you want indexes on, use this line of code:
Options +Indexes
If you want to have your directory contents listed, but want certain file types to be excluded from the listings, use this code:
IndexIgnore *.extension *.extension
extension is the extension of the files you don't want listed(eg. *.gif *.jpg *.jpeg). If you were to put Indexignore * in the file, that would be the same as Options -Indexes.
