man thinking

Managing Search Crawlers using robots.txt

Do you really want Google to index your Master Page Gallery? One of the things that often gets overlooked when setting up a public-facing SharePoint site is that sooner or later you are likely to get visits from over-enthusiastic search engines. In truth, sometimes this owes more to hope than expectation, but we can't all be the next "myface". The problem is that as well as the content we intended to be publicly visible, SharePoint creates a lot of pages that should be kept firmly hidden from view in the back-stage area. You should only see them if you are responsible for shifting the scenery around. But search engines are greedy, and they will try to grab anything they can find.

OhBoySharePoint.jpg

Now there is a separate issue about making these pages inaccessible to anonymous users (which is not as big a deal as it sounds - they can't actually do anything, it just feels wrong if they get to the View All Site Content page). But what we certainly don't want is those pages, which will mostly likely be unbranded, putting in an appearance in the Bingle search index. This can even be a problem for SharePoint Search within your Intranet. So this isn't just confined to public-facing sites.

In order to keep those pesky spiders under control we need to put a file robots.txt in the root directory of the site, so that it can be retrieved at http://www.site.com/robots.txt. The browser-based user interface for SharePoint allows us to manage files in SharePoint lists and document libraries. Unfortunately it does not include a means of manipulating files in the root, such as default.aspx. There are two ways of doing this. The first is to open the site in SharePoint Designer. This allows us a behind-the-scenes view of our site and also has powerful editing tools. But it may seem overkill if we just want to add a file. A second way is to use the WebDAV protocol (Web Distributed Authoring and Versioning). To do this, open Windows Explorer and enter the following in the address bar: \\\\www.site.com\\DavWWWRoot. This will open the root of the site as if it were a directory on your computer. I can never remember the DavWWWRoot thing - I always type in WWWDAV or WebDAVRoot or I use forward slashes.

Once we have an Explorer window open we can create the robots.txt file is if we were working locally. I suggest you try the following content of the robots.txt file initially.

User-Agent: *
Disallow: /_layouts/
Disallow: /_catalogs/
Disallow: /_cts/
Disallow: /_private/
Disallow: /Lists/
Disallow: /m/
Disallow: /ReusableContent/
Disallow: /WorkflowTasks/
Disallow: /SiteCollectionDocuments/
Disallow: /SiteCollectionImages/
Disallow: /SiteAssets/
Disallow: /Documents/Forms/
Disallow: /Pages/Forms/
Disallow: /Search/

When you have uploaded it, go to your website as an anonymous user and check that you can access the file at http://www.site.com/robots.txt. Once you have this working you can fine tune it as you discover files or folders that might be crawled. You can study your server logs to find what the various search engines are crawling. Or you can use a site-specific search term to see what is in the index; for Google you would use something like "site:www.site.com". Better still, the main search engines have excellent tools for webmasters (e.g. Google Webmaster Tools). To use these you will probably need to put a special file in the root of your site (to prove that you own it). You can do that using the same technique as described above for the robots.txt file.

One more thing: if you are using SharePoint Search to crawl a SharePoint site, be aware that it will detect that it is SharePoint even if you create a "Web Sites" type content source. SharePoint Search will crawl the site in the same way as it would crawl the local site, ignoring robots.txt. To get around this, you need to set a crawl rule and set this to include, and then set it to "Crawl SharePoint content as http pages".  You might also need to set the "Crawl complex URLs" flag if you are using query string parameters to render different content (like I do, for example - look at the URL of this page). One other thing to be aware of with SharePoint Search is that it will cache the robots.txt file for 24 hours, so if you change it you will also need to re-start the search service before you do a re-crawl if you want to see the effects.