Recently I’ve had a customer who used SharePoint 2010 to crawl a large file share with over 9TB worth of data. The content crawling account was set up to be able to access all the documents from the file share and so all documents landed in the SharePoint search index.
So far so good.
However, after a while, the owners of several folders on the file share containing sensitive data decided to remove their documents from the search index, so they simply denied the crawler account the permissions to access their folder. The TechNet article Best practices for using crawl logs (SharePoint Server 2010) states the following:
“When a crawler cannot find an item that exists in the index because the URL is obsolete or it cannot be accessed due to a network outage, the crawler reports an error for that item in that crawl. If this continues during the next three crawls, the item is deleted from the index. For file-share content sources, items are immediately deleted from the index when they are deleted from the file share.”
Sounds simple, the crawler account cannot access the files any more and so it should think that they are not there any more and remove them from the index, right?
Wrong. We did several incremental crawls, followed by a full crawl but to no avail, the items were still in the index and would show up in the search center. Only index reset helped, but this is not a feasible solution when you have 9TB of data to crawl.
As it turnes out, the crawler now started getting “Access Denied” errors when attempthing to recrawl the documents, which is actually expected, since the files are still there, but not accessible any more. So in this case SharePoint 2010 behaves a little bit different. It will keep trying for 30 cralws AND 30 days, and only then give up and remove the items from the index.
So what if you don’t want to wait for 30 days for the items to be removed, you might ask? Thankfully, there are several policies that tell the crawler when to remove the items in case it encounters an error while crawling and you can adjust those policies by changing several properties using PowerShell:
You will have to repeat this for all search service applications that you might have.
When the crawler encounters an access denied or a file not found error, the item is deleted from the index if the error was encountered in more than ErrorDeleteCountAllowed (the default is 30) consecutive crawls AND the duration since the first error is greater than ErrorDeleteIntervalAllowed hours (the default is 720 hours).
If both conditions are not met, the item is retried, otherwise it is deleted from the index. In this example the item will be removed from the index after 3 unsuccessfull crawls due to an access denied or file not found errors AND after 10 hours since the first failed crawl (both conditions must be met). You might need to read it once more to get it
The ErrorDeleteCountAllowed and ErrorDeleteIntervalAllowed properties apply to both incremental and full crawls.
In addition, you can use the DeleteUnvisitedMethod property to specify what items get deleted during the full crawl only. Setting this property to 0 will immediately remove from the index all items that are no longer found in the current full craw.
Be careful when setting these properties if your file share suffers from occasional network connection problems. Setting them to too low values might lead to a large number of items being removed from the index because of a temporary network outage, only to be reindexed again when the network connection is working again. This could kill your search perfomance, so be careful!.
You can find the full explanation of these policies and other properties that you can use to fine tune the index cleanup logic under Manage deletion of index items (SharePoint Server 2010).
Thanks to Sorin Stanila and our escalation engineers for helping solve this problem!