How and Why to Build a Robots.txt
Some of you have asked 'How do I keep 'search engine A' from indexing pages designed for 'search engine B'. The answer is to use a robots.txt file. There are also other reasons for wanting to keep search engines from indexing some or all pages on a site. Therefore, I've put together this detailed article to show you how to do that, and to avoid common mistakes that are made all too often.
If you create different versions of essentially the same doorway page and every search engine indexes every copy of the page, then you could, in theory, get in trouble for spamming. AltaVista in particular is known to dislike duplicate or near duplicate content. Therefore, if you create pages that are too similar, you run the risk of being red-flagged.
In practice, many people don't worry about having too many duplicate pages indexed by a single search engine because they are not creating huge numbers of similar pages. In fact, I spoke to the CEO’s of three search engine positioning companies. Each said they did not use robots.txt files, although their reasons varied.
If your pages vary enough in the size and number of words, then you should not need to worry about being red-flagged. If you focus primarily on optimizing existing pages on your site that have unique content rather than building a lot of new pages that are similar, you’ll also avoid any potential problems.
Other people simply submit pages designed for a particular search engine to just the engine(s) to which the page applies. This can be the simplest method to avoid spamming the search engines. This can work IF there are no other links to that page from the rest of your site. However, if another search engine spider manages to find a link to that page, it could index it even though you never submitted the page to that engine.
Despite this, two of the consulting companies I spoke to said submitting hallway pages that pointed to just the pages built for that engine worked well for them and they’d done it successfully for years.
However, if you want to create a lot of doorway pages, targeted for a large number of engines, where many will be very similar to each other, you should consider using a robots.txt file. This file can tell the search engine spiders which pages they are not allowed to index. That way you can build pages for search engine A and tell search engine B to ignore them. The search engine's like this because it keeps them from indexing pages that don’t apply to them. Therefore, it benefits the search engines, the users that use that engine, and it keeps you from being labeled as a spammer.
I have seen some people debate whether the search engines even honor robots.txt files since this is a purely voluntary feature of the Web. However, the search engines have historically been challenged by companies - and in the courts - about indexing copyrighted materials without the permission of the copyright holder. The search engine's most prominent argument for being able to index copyrighted material without permission is that the Web site owner always has the option to exclude their indexing by creating a robots.txt file.
Therefore, it's unlikely the search engines would intentionally ignore the robots.txt or they could get themselves into unnecessary legal problems. They might in theory spider the page and then after checking the robots.txt file drop it. This may explain reports I've heard from a couple of people that claim the spider ignored their robots.txt file because they saw it opening the page in their log file. Another explanation is that the Webmaster used the wrong syntax when creating the robots.txt. Therefore, always double check your work.
I'll try to point out common errors in this article and give plenty of examples. Please don't be intimidated. It really is not as hard as it looks. There is also a method you can set it up once and then not have to mess with it ever again. I’ll explain this method near the end of this article.
To create a robots.txt file, open Window's NotePad or any other editor that can save plain ASCII .txt files. Use the following syntax to exclude a file name from a particular search engine spider:
User-agent: {SpiderNameHere}
Disallow: {FilenameHere}
Note: For the purpose of this article, the term spider and search engine may be used interchangeably.
For example, to tell Excite's spider, called ArchitextSpider to not index files called orderform.html, product1.html, and product2.html, create a robots.txt file as follows:
User-agent: ArchitextSpider
Disallow: /orderform.html
Disallow: /product1.html
Disallow: /product2.html
According to the official robots.txt specifications, the above is case-sensitive so you should spell it as 'User-agent:' rather than 'User-Agent:'. Whether this causes a problem in practice, I cannot say for certain. To be safe, keep the names in the correct case. In addition, make sure you include a forward-slash before the file name if the file is in the root directory.
The User-agent line is the identifier for the search engine you wish to target. It is like a 'code name' for the search engine's spider that goes around and indexes pages on the Web. It may be similar to the name of the search engine or it may be completely different. (I'll list the official User-agent names for the major engines later in this article).
Once you create your robots.txt, you would then upload this text file to the root directory of your Web site. Although robots.txt is a voluntary protocol, most major search engines will honor it. If you do not have your own domain name but instead use a subdirectory off of your host's domain, then your robots.txt may not be recognized in theory since standard practice is to look only at the root directory of the domain. This is just one more reason to invest in your own domain name!
You can add additional lines to exclude pages from other engines by specifying the User-Agent parameter again in the same file followed by more Disallow lines. Each disallow statement will be applied to the last User-Agent that was specified.
If you want to exclude an entire directory, use this syntax:
User-agent: ArchitextSpider
Disallow: /mydirectory/
A common mistake is to include the asterisk after the directory name to indicate that you want to exclude all files in that directory. However, the proper syntax is to NOT include any asterisks in the Disallow statement. According to the robots.txt specifications, it is implied that the above statement will disallow all files in 'mydirectory.'
To disallow a file named product.htm in the 'mydirectory' subdirectory, do this:
User-agent: ArchitextSpider
Disallow: /mydirectory/product.htm
You can exclude pages from ALL spiders with this User-agent:
User-agent: *
In the case of the User-agent line, you CAN use the asterisk as a wildcard.
To disallow all pages on your Web site for the specified spider use:
Disallow: /
To re-iterate, you use only a forward slash to indicate you want to disallow your entire site. Do NOT use an asterisk here. It's important that you use the proper syntax. If you misspell something, it may not work and you won't know it until it's too late! It is possible that certain search engines may handle common syntax variations without problems. However, this doesn't guarantee that they will all tolerate variances in the syntax. Therefore, play it safe. If at some point you do find that your syntax was wrong, don't panic. Correct the problem and then re-submit. The search engine will then re-spider the site and drop the pages that you excluded.
If you wish to include comments in your robots.txt file, you should precede them with a # sign like this:
# Here are my comments about this entry.
Each set of disallow statements should be separated by a blank line. For example, you might have something like the following to exclude different files from different spiders:
User-agent: ArchitextSpider
Disallow: /mydirectory/product.htm
Disallow: /mydirectory/product2.htm
User-agent: Infoseek
Disallow: /mydirectory/product3.htm
Disallow: /mydirectory/product4.htm
The blank line between the two groups is important to group things into 'records.'
If, on the other hand you wanted to exclude the same set of files for more than one spider, you could do something like this:
User-agent: ArchitextSpider
User-agent: Infoseek
Disallow: /mydirectory/product.htm
Disallow: /mydirectory/product2.htmSide note about subdirectories: Some Webmasters like to organize their doorway pages into different subdirectories according to which search engine they are optimized for. However, some engines are suspected of assigning lower rankings to pages appearing in subdirectories versus the root directory of a Web site. If they perceive that those pages belong to a Web site that shares a domain with its host, they could discriminate against those pages as being potentially of a lesser quality. I asked three search engine consultants there opinion of subdirectories. The general feeling was that pages in the root directory was probably better, but they’d not seen evidence that it caused problems.
If you were still concerned about being penalized for keeping pages in subdirectories and wished to use them, you could ask your hosting service to give you 'machine names' like myproduct.mydomain.com that you could submit. The myproduct.mydomain.com URL could then be configured by your hosting service to point to your 'myproduct' subdirectory or whatever directory you desired. That way no discrimination could occur by the search engine since they would not see the subdirectory in the URL. In addition, you could include keywords in that machine name which may also improve your rankings. (Note: A machine name is normally just 'www.' prefixed at the start of your domain name. However, rather than 'www.' it could be any name you desire and it could point to any location on any physical machine).
We are often asked about the proper names for the User-agent. The name of the agent does not always correspond to the name of the search engine. Therefore, you can't just put in 'AltaVista' in the User-agent and expect AltaVista to exclude your designated pages. Don't ask me why it can't be that simple. Perhaps it's a job security plan for professional Webmasters :-)
In any case, there's a lot of confusion in newsgroup forums and on the Web about what the proper agent names should be. The confusion derives from Webmasters reading their server log files and noticing all kinds of complicated agent names being logged such as Scooter/2.0 G.R.A.B. X2.0, InfoSeek Sidewinder/0.9, or Slurp/2.0. However, the agent names listed in your log are not necessarily what you are expected to use in your robots.txt file.
The reason is very logical when you think about it. Names like InfoSeek Sidewinder/0.9 in a robots.txt are not very useful if the search engine updates their agent software and decides to start using Infoseek Sidewinder/2.0 as their new name next month. Would it make sense to expect millions of Webmasters to know this and to all update their robots.txt files to the new name? Would they expect people to update the file EVERYTIME any search engine updated their agent version number and do it precisely when the name change occurred? It's not likely.
In reality, the name that needs to appear in the robots.txt file is whatever name the search engine spider is programmed to look for. Therefore, the best source of information for this name is not your log files but the help files on the search engine itself. In theory, a search engine could look for a wide variety of name variations. However, in general they will simply look for the least common denominator such as 'Scooter' rather than 'Scooter/2.0'. If the search engine is smart they will allow you to use Scooter/2.0 too, but that is not guaranteed. Therefore, if you've already setup a robots.txt on your site, double-check the syntax and the agent names against the list below. All names are case sensitive.
Here are the User-Agent names that we have compiled. Most of these came directly from the search engine's own help files, or when not available, from other respected sources:
Search Engine:User-AgentAltaVista:ScooterInfoseek:InfoseekHotbot:SlurpAOL:SlurpExcite:ArchitextSpiderGoogle:GooglebotGoto:SlurpLycos:LycosMSN:SlurpNetscape:GooglebotNorthernLight:GulliverWebCrawler:ArchitextSpiderIwon:SlurpFast:FastDirectHit:GrabberYahoo Web Pages:GooglebotLooksmart Web Pages:Slurpvery old list for demonstration purposes only
You'll notice that many of the engines use the 'Slurp' agent which is the Inktomi spider used on HotBot and other Inktomi related sites. Unfortunately, I'm not aware of a way you can exclude pages from the HotBot spider and not exclude them from all other Inktomi sites. As far as I can tell, they use the same spider to index the pages and thereby recognize only one User-agent string in the robots.txt file. (If I am wrong, please reply to this e-mail and let me know how this is done!)
The individual Inktomi sites tend to rank the pages differently, although they will often be rather similar. Normally you can create a handful of pages that will rank well on most of the Inktomi powered sites, so the duplicated content issue does not normally become a big problem with Inktomi.
If you're now scratching your head on how this all comes together in relation to the doorway pages you created, check out the two detailed examples of a robots.txt file we've put together.
The first one shows how you can disallow INDIVIDUAL files:
The second example shows how you can group your doorway pages into DIRECTORIES and disallow the entire directory.
The advantage to method #1 is that it can be more flexible for working with a small number of files already on your site, and it 'might' be a little safer. Some people believe that locating your doorways in the root directory rather than a subdirectory can give you a ranking advantage. The theory is that the search engines might discriminate against sites that don't have their own domain name, so pages submitted in a subdirectory could be perceived as sharing a domain with their host.
The disadvantage to method #1 is that if you have very many doorway pages then the size of your robots.txt file could be enormous. This runs the risk that a search engine might have problems with a robots.txt file that exceeds a certain reasonable size. It might also slow down the spider from accessing your site if it must read in an extremely large robots.txt file. Lastly, a robots.txt with a lot of entries in it could be a red-flag in itself to a search engine. This is all speculation, but it’s enough that I would avoid excluding a lot of files individually if you don’t have to.
Example method #2 is to organize your doorway pages into subdirectories for each search engine. The advantage to method #2 is that it is much easier to track your doorway pages if they are organized in separate subdirectories. In addition, the size of your robots.txt will be relatively small. You'll also not need to update the file every time you upload new doorway pages. Once the robots.txt is set up with method #2, all you have to do is upload it to the appropriate directory, submit and you're done!
So do the engines discriminate against files in subdirectories? The consultants I talked to did not think so. Based on these conversations, if you properly design a hallway page in your ROOT directory that links to the doorway pages in your subdirectory, and submit that hallway page, then you’ll be fine. This demonstrates to the engine that the pages are most likely sub-pages of the main site. In addition, it would be dangerous for the search engines to penalize pages in subdirectories since most large Web sites must organize their pages into subdirectories to avoid complete chaos. As an added precaution, you could assign machine names to subdirectories as I mentioned earlier in this article. If you have any experience, comments, or observations on this issue, please let me know by replying to this e-mail.
My conclusion: If all your pages have good content and are fairly unique, don’t worry about robots.txt files. If you focus only on optimizing existing pages on your site, don’t worry about a robots.txt. If, however, you decide you need to experiment with more than a handful of pages that are rather similar, consider making use of the robots.txt file, particularly with AltaVista. Use example method #1 if you’re only dealing with a small number of pages or special scenarios. Otherwise, organize your files into directories and use example method #2.
However, I tested it on a couple of files and it sometimes complained about things that were perfectly valid. Therefore, the service in my opinion may be too buggy to be of great use. If it points out errors in your file, refer to this article or the article below to verify that the errors it catches are real. If you know of a better syntax checker, let me know and I'll pass the information along and give you and your Web site credit for the tip. I will also try to give you credit for any other search engine marketing tips I end up using of which I was not aware!
Note: The information presented here adapted, under license agreement, from FirstPlace Software.