Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.
It works likes this: a robot wants to vists a Web site URL, sayhttp://www.example.com/welcome.html. Before it does so, it firstschecks for http://www.example.com/robots.txt, and finds:
User-agent: * Disallow: /
The "User-agent: *" means this section applies to all robots.The "Disallow: /" tells the robot that it should not visit anypages on the site.
There are two important considerations when using /robots.txt:
- robots can ignore your /robots.txt. Especially malware robots that scan theweb for security vulnerabilities, and email address harvesters used by spammerswill pay no attention.
- the /robots.txt file is a publicly available file. Anyone can see what sectionsof your server you don't want robots to use.
How To Use It
The "/robots.txt" file is a text file, with one or more records.Usually contains a single record looking like this:
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /~joe/
In this example, three directories are excluded.
Note that you need a separate "Disallow" line for every URL prefix youwant to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/" on asingle line. Also, you may not have blank lines in a record, as theyare used to delimit multiple records.
Note also that globbing and regular expression arenot supported in either the User-agent or Disallowlines. The '*' in the User-agent field is a special value meaning "anyrobot". Specifically, you cannot have lines like "User-agent: *bot*","Disallow: /tmp/*" or "Disallow: *.gif".
What you want to exclude depends on your server. Everything not explicitly disallowed is considered fairgame to retrieve. Here follow some examples:
To exclude all robots from the entire server
User-agent: * Disallow: /
To allow all robots complete access
User-agent: * Disallow:
(or just create an empty "/robots.txt" file, or don't use one at all)
To exclude all robots from part of the server
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/
To exclude a single robot
User-agent: BadBot Disallow: /
To allow a single robot
User-agent: Google Disallow: User-agent: * Disallow: /
To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field. Theeasy way is to put all files to be disallowed into a separatedirectory, say "stuff", and leave the one file in the level abovethis directory:
User-agent: * Disallow: /~joe/stuff/
Alternatively you can explicitly disallow all disallowed pages:
User-agent: * Disallow: /~joe/junk.html Disallow: /~joe/foo.html Disallow: /~joe/bar.html