As SEO practitioners, there’s often a good reason to exclude a certain search engine or site spider from looking at your pages. For example, some people who do not want their site archived over time by Archive.org will take action to block them from doing so. That’s where the robots.txt kicks in; set one up on your server, and most websites will adhere to your commands as far as what they are allowed to spider on your site. If you wish to prevent a specific site spider from indexing your website’s content, then configure your robots.txt file this way:
User-agent: Name_of_Robot
Disallow: /
Here are some of the most common site robot names that you might want to exclude:
Archive.org: ia_archiver
Google: googlebot
Yahoo: Slurp
Regardless of what you’re trying to do with robots.txt, the most important thing is to make sure you’re doing it right. A small typo in your config file can cause disastrous results. So make sure to use the following resources to set yours up properly.
For thorough lists of robots files you can exclude, check these sites:
http://www.hostsun.com/gr/bots_index.php
http://www.robotstxt.org/wc/active.html
For thorough documentation as to how robots.txt files work, and how to properly set them up, check out these sites:
Leave a Reply