Feb 11, 2007
Blocking SE Spiders
I think there are certain pages or files (folders) that every webmaster should block from the SE spiders.
Dan Crow, Product Manager at Google just posted this
on the official Google Blog.
I’m often asked about how Google and search engines work. One key question is: how does Google know what parts of a website the site owner wants to have show up in search results? Can publishers specify that some parts of the site should be private and non-searchable? The good news is that those who publish on the web have a lot of control over which pages should appear in search results.
The key is a simple file called robots.txt that has been an industry standard for many years.
Dan gives some examples on how to control spiders using robots.txt, or the robots meta tag.
Why would you want to limit the SEs? One example. I have a folder that contains confidential customer information. I do not want the SEs to crawl and publish that information on the Web.
Most sites need to block their https (or secure) pages. Something that Jag is not doing at this time. Do not let the spiders crawl your https pages unless your entire site is https.
All the major SEs will obey the robots.txt file, or the on page meta tag in the head section of the document.