An Introduction to Robots.txt

November 22nd, 2009 · No Comments

The robots.txt file is a robot (search engine crawler) exclusion file. Placed at the root of a website, this file can give search engines a variety of informations about the way the website is meant to be crawled.

To create a robots.txt, you can use a simple word editor such as Notepad. Make sure not to use any text formatting editors such as Microsoft Word.

Let’s look at some examples of robots.txt content :

User-agent: *
Disallow: /

In the example above, no search engines are allowed to crawled the website (assuming they’re respecting the Robots Exclusion Protocol of course). This is useful when your website is still under construction and you want to prevent it from being crawled before completion for instance.

Here’s another example. In this case, Yahoo’s web crawler, identified as Yahoo! Slurp, would be restricted from crawling the content of the /search folder of a website.

User-agent: Yahoo! Slurp
Disallow: /search

Multiple instructions can be included in a single robots.txt file :

User-agent: googlebot
Disallow: /admin
Disallow: /support
Allow: /products

User-agent: msnbot
Disallow: /forum

User-agent: Yahoo! Slurp
Allow: /

The robots.txt can also be used to specify the location of the sitemap.xml file although this is not supported by every search engine :

Sitemap: http://www.thewebhostinghero.com/sitemap.xml.gz

New extended standards for robots.txt have been proposed to limit the crawling rate and schedule of search engines but they have not been made official yet :

Request-rate: 2/5
Visit-time: 0000-0500

In this case, crawlers would be instructed to index a maximum of 2 pages every 5 seconds and to access the website only between midnight (00:00) and 5 A.M. (05:00).

If you need to find a particular user-agent, visit http://www.user-agents.org

0 responses so far ↓

There are no comments yet...Kick things off by filling out the form below.

Leave a Comment




 
 
 
 
 

Recent Comments