Refining Search Traffic with Robots.txt

Background

The robots.txt file provides guidance to the search agents (from Google, Yahoo, MSN, etc.) on how to search your web site.  Well-behaving search agents (aka “bots”, “crawlers”, or “spiders”) look for a robots.txt file in the root directory of the web site in order to determine permissions, restrictions, or hints for properly crawling and indexing the site.  Observance of the controls expressed in a robots.txt file is strictly voluntary on the part of the search agents and is not a substitution  for appropriate security measures on any sensitive information.

The robots.txt file is intended to be read and processed only by the automated search agents.  The rules expressed in the file do not apply to users visiting the site with a typical web browser (e.g. Internet Explorer, FireFox, Safari, etc.).

File Format

The format of a robots.txt file consists of a minimum of one record.  Each record consists of a series of directives with one or more User-agent directives followed by one or more Disallow directives.  There can be only one directive specified per line, and it should be in the format of <directive>: <value>.  Comments are specified by a leading “#” character.

While there are a number of other possible directives that can be specified within a record, the minimum required are a User-agent and a Disallow directive.

Location

The search agent looks for the robots.txt file at the root of the web site.  If the server is running more than one site, then one file should be created for each site.  The exact location may vary depending on the site configuration, but this can be checked in the Server Admin application.  (Select the “Web” service, click on the “Sites” icon, and note the “Web Folder” value for each Domain Name.)

Basic Directives

User-agent

The User-agent directive consists of either the name of a search agent or the wildcard character (“*”).  The wildcard character indicates the rule set applies to all search agents and can only be specified once in the robots.txt file.

Disallow

The Disallow directive follows the User-agent directive and specifies a path (either partial or full) that should not be accessed by the search agent.  Multiple paths may be specified by using multiple Disallow directives within a record.  Multiple paths should not be specified within a single Disallow directive.  The value of the directive may be left blank to indicate there are no agent restrictions.

Additional Directives

There are a number of additional directives that may be specified in a record, but not all of them are recognized by the various search agents.  The only directives that seem to be universally parsed are the basic directives — User-agent and Disallow.

Crawl-delay

The Crawl-delay directive specifies a delay (in seconds) between requests by a search agent.  The delay can be used to help prevent an aggressive search agent from overloading the site by issuing requests too rapidly.

Visit-time

The Visit-time directive specifies the acceptable time range for a search agent to crawl the site.  This can be used to defer any search traffic that might otherwise occur during peak traffic times.  The times specified are in 24 hour (military) format and are based on GMT.  (e.g. “2200-0530″ would advise search agents to crawl the site from 10:00pm to 5:30am GMT.)

Request-rate

The Request-rate directive is similar in effect to the Crawl-delay directive however instead of specifying an absolute time between requests a rate is used.  For example, a rate of 1/30 indicates a maximum of 1 request should be issued every 30 seconds.  In addition, time of day restrictions can also be specified which can be used to provide a more flexible policy with different rates for different times of day.

Comment

The Comment directive can be used instead of the comment delimiter (“#”).  The Comment directive is intended to be returned to the administrator of the search agent and should contain a statement of the site’s crawl policy.

Allow

The Allow directive provides an exception into an otherwise disallowed area of the site.  This may provide a means whereby a search agent can bypass restrictions in order to access deeply-nested content.

Robot-version

The Robot-version directive specifies which version of the Robot Exclusion Standard the rule set observes.  It is intended primarily for site documentation purposes.

Examples

Open Access

The wildcard User-agent indicates the record applies to all search agents and the blank Disallow value means there are no restrictions.

User-agent: *
Disallow:

Blocked Access

The wildcard User-agent indicates the record applies to all search agents and the specification of the root directory (“/”) restricts the search agent from all content.

User-agent: *
Disallow: /

Multiple Restrictions

The wildcard User-agent indicates the record applies to all search agents and the Disallow directives restrict the search agents from content in the /doc/private and /cgi-bin directories.

User-agent: *
Disallow: /doc/private
Disallow: /cgi-bin

Timed Access

The wildcard User-agent indicates the record applies to all search agents and the Disallow directive restricts the search agents from content in the /doc/private directory.  The Visit-time directive advises the search agent to only attempt to crawl the site betwen 11:00pm and 4:30am (GMT).

User-agent: *
Disallow: /doc/private
Visit-time: 2300-0430

Working Around ISP Mail Server Blocks

Background

It is a common practice for many ISP’s to place restrictions on the sending and receiving of mail in order to reduce the potential for abuse by spammers.  For the typical home user, these restrictions are non-intrusive and are effective in preventing gratuitous abuse of the mail systems.  However, for users running their own mail server, these restrictions interfere with the proper operation of the service.

The origin for the restrictions began when spammers began directly injecting their email into the ISP’s mail relay servers for distribution.  This activity caused massive headaches for the ISP, as server utilization soared, bandwidth costs increased, and they would receive a torrent of angry calls from other ISPs for relaying so much spam.  The solution was to impose access restrictions on the mail relay servers so that only authenticated users from within the ISP’s network would be able to access them.  While effective in protecting the ISP’s mail relay servers, the tactics of the spammers shifted to work around this obstacle.

With direct access to the mail relay servers of the ISP blocked, hackers turned to using the large BotNets of Zombie Computers.  These compromised systems were used by the hackers to run their own mail relay servers and so the spam continued to flood the mail systems.  In order to reduce the effectiveness of this strategy, the ISPs  blocked all traffic for port 25 (the standard communication port for mail relay service) to their customers’ systems.  This prevented these systems from proxying spam throughout the email delivery systems.

In order to run a mail server, the restrictions imposed by the ISP must be overcome.  In some cases, simply contacting the customer support of the ISP can result in a lifting of the restrictions on the connection.  However, some ISPs do not have a flexible policy in place for customers wishing to run their own mail servers and so extra effort must be made in order to work around the situation.  Typically, there are two main blocks that must be overcome for mail service to be fully functional:  inbound blocking and outbound blocking.  The majority of ISPs implement both types of blocks, but occasionally only one or the other form of blocking is implemented.

Service Requirements

In order to work around the inbound or outbound blocks imposed by the ISP, an additional service provider must be used.  DynDNS.com offers several services that can be used to work around the ISP restrictions.  There are several other service providers that may have similar offerings, so you may wish to shop around.

The MailHop Relay service offers the ability to redirect incoming mail from the standard mail relay port (port 25) to an alternate port which is not blocked.  In addition, it provides a number of other benefits including:  spam filtering, white-listing, black-listing, virus scanning, and back-up queueing (which safely stores any incoming mail if your mail server should be temporarily unavailable).

The MailHop Outbound service offers the ability to bypass any outbound restrictions your ISP may have in place.  It is able to accept mail destined for other domains on an alternate port and so avoid a block placed on the standard mail relay port.  In addition, it offers a secure, authenticated mail relay connection (via SSL), the ability to send mail from any network, outgoing virus scanning, and detailed usage graphs.

Inbound Blocking

Inbound blocking prevents other mail servers from relaying mail to your mail server.  This means that mail which originates from another domain will not be able to be delivered to your email system.  Mail which originates from within your domain will not be affected by this type of restriction.  The MailHop Relay service offered by DynDNS is able to work around this restriction by accepting mail on behalf of your domain and then relaying it to your domain on an alternate port.

  1. Subscribe to and setup the MailHop Relay service on the DynDNS site.
    1. Specify the destination mail server (e.g. example.com)
    2. Specify the alternate relay port (e.g. 10025)
    3. Enable any additional services desired (spam checking, virus scanning, etc.)
    4. Ensure the Mail Exchange (MX) Records are setup correctly for the domain
  2. Create a port-forwarding rule on your gateway device to forward incoming connections from the alternate relay port specified (e.g. 10025) to the local mail server and standard mail relay port (e.g. mail.example.com, port 25).
    Protocol   External Port    Destination Address    Destination Port
       TCP         10025          mail.example.com            25
  3. Verify inbound mail service is working properly by using Yahoo Mail to send a message to a test account in your domain.  It may take a few minutes for the DNS changes to propagate fully, so be patient if the mail does not appear quickly.

Outbound Blocking

Outbound blocking prevents mail originating from your server from being relayed to any other mail server.  Mail sent from within the domain to other members within the domain will not be affected.  The MailHop Outbound service offered by DynDNS is able to work around this restriction by accepting mail originating from your domain on an alternate port and then forwarding it on to the destination mail servers.

  1. Subscribe to the MailHop Outbound service on the DynDNS site.
  2. Open the Server Admin application and connect to the server hosting the mail service.
  3. Select the Mail service from the list of available services.
  4. Click on the “Settings” icon and select the “General” tab.
  5. Enable the “Relay outgoing mail through host” option.
  6. Enter the MailHop Outbound server address and port.
    outbound.mailhop.org:10025
  7. Enable the “Authenticate to relay with user name” option.
  8. Enter the account name and password for your DynDNS account in the appropriate fields.
  9. Restart the Mail service.