Background
The robots.txt file provides guidance to the search agents (from Google, Yahoo, MSN, etc.) on how to search your web site. Well-behaving search agents (aka “bots”, “crawlers”, or “spiders”) look for a robots.txt file in the root directory of the web site in order to determine permissions, restrictions, or hints for properly crawling and indexing the site. Observance of the controls expressed in a robots.txt file is strictly voluntary on the part of the search agents and is not a substitution for appropriate security measures on any sensitive information.
The robots.txt file is intended to be read and processed only by the automated search agents. The rules expressed in the file do not apply to users visiting the site with a typical web browser (e.g. Internet Explorer, FireFox, Safari, etc.).
File Format
The format of a robots.txt file consists of a minimum of one record. Each record consists of a series of directives with one or more User-agent directives followed by one or more Disallow directives. There can be only one directive specified per line, and it should be in the format of <directive>: <value>. Comments are specified by a leading “#” character.
While there are a number of other possible directives that can be specified within a record, the minimum required are a User-agent and a Disallow directive.
Location
The search agent looks for the robots.txt file at the root of the web site. If the server is running more than one site, then one file should be created for each site. The exact location may vary depending on the site configuration, but this can be checked in the Server Admin application. (Select the “Web” service, click on the “Sites” icon, and note the “Web Folder” value for each Domain Name.)
Basic Directives
User-agent
The User-agent directive consists of either the name of a search agent or the wildcard character (“*”). The wildcard character indicates the rule set applies to all search agents and can only be specified once in the robots.txt file.
Disallow
The Disallow directive follows the User-agent directive and specifies a path (either partial or full) that should not be accessed by the search agent. Multiple paths may be specified by using multiple Disallow directives within a record. Multiple paths should not be specified within a single Disallow directive. The value of the directive may be left blank to indicate there are no agent restrictions.
Additional Directives
There are a number of additional directives that may be specified in a record, but not all of them are recognized by the various search agents. The only directives that seem to be universally parsed are the basic directives — User-agent and Disallow.
Crawl-delay
The Crawl-delay directive specifies a delay (in seconds) between requests by a search agent. The delay can be used to help prevent an aggressive search agent from overloading the site by issuing requests too rapidly.
Visit-time
The Visit-time directive specifies the acceptable time range for a search agent to crawl the site. This can be used to defer any search traffic that might otherwise occur during peak traffic times. The times specified are in 24 hour (military) format and are based on GMT. (e.g. “2200-0530″ would advise search agents to crawl the site from 10:00pm to 5:30am GMT.)
Request-rate
The Request-rate directive is similar in effect to the Crawl-delay directive however instead of specifying an absolute time between requests a rate is used. For example, a rate of 1/30 indicates a maximum of 1 request should be issued every 30 seconds. In addition, time of day restrictions can also be specified which can be used to provide a more flexible policy with different rates for different times of day.
Comment
The Comment directive can be used instead of the comment delimiter (“#”). The Comment directive is intended to be returned to the administrator of the search agent and should contain a statement of the site’s crawl policy.
Allow
The Allow directive provides an exception into an otherwise disallowed area of the site. This may provide a means whereby a search agent can bypass restrictions in order to access deeply-nested content.
Robot-version
The Robot-version directive specifies which version of the Robot Exclusion Standard the rule set observes. It is intended primarily for site documentation purposes.
Examples
Open Access
The wildcard User-agent indicates the record applies to all search agents and the blank Disallow value means there are no restrictions.
User-agent: *
Disallow:
Blocked Access
The wildcard User-agent indicates the record applies to all search agents and the specification of the root directory (“/”) restricts the search agent from all content.
User-agent: *
Disallow: /
Multiple Restrictions
The wildcard User-agent indicates the record applies to all search agents and the Disallow directives restrict the search agents from content in the /doc/private and /cgi-bin directories.
User-agent: *
Disallow: /doc/private
Disallow: /cgi-bin
Timed Access
The wildcard User-agent indicates the record applies to all search agents and the Disallow directive restricts the search agents from content in the /doc/private directory. The Visit-time directive advises the search agent to only attempt to crawl the site betwen 11:00pm and 4:30am (GMT).
User-agent: *
Disallow: /doc/private
Visit-time: 2300-0430