Schedule administration changes that require a full crawl to occur shortly before the planned schedule for full crawls. For example, we recommend that you attempt to schedule the creation of the crawl rule before the next scheduled ful
When you plan crawl schedules, consider the following best practices:
-
Group start addresses in content sources based on similar availability and with acceptable overall resource usage for the servers that host the content.
l crawl so that an
additional full crawl is not necessary.
-
Base simultaneous crawls on the capacity of the index server to crawl them. We recommend that you typically stagger your crawl schedules so that the index server does not crawl using multiple content sources at the same time. For best performance, we suggest
that you stagger the crawling schedules of content sources. The performance of the index server and the servers hosting the content determines the extent to which crawls can be overlapped. A strategy for crawl scheduling can be developed over time as you can
become familiar with the typical crawl durations for each content source.
Reasons for a search services administrator to do a full crawl include:
-
One or more hotfix or service pack was installed on servers in the farm. See the instructions for the hotfix or service pack for more information.
-
An SSP administrator added a new managed property.
-
To re-index ASPX pages on Windows SharePoint Services 3.0 or Office SharePoint Server 2007 sites. The crawler cannot discover when ASPX pages on Windows SharePoint Services 3.0 or Office SharePoint Server 2007 sites have changed. Because of this, incremental
crawls do not re-index views or home pages when individual list items are deleted. We recommend that you periodically do full crawls of sites that contain ASPX files to ensure that these pages are re-indexed.
-
To detect security changes that were made on a file share after the last full crawl of the file share.
-
To resolve consecutive incremental crawl failures. In rare cases, if an incremental crawl fails one hundred consecutive times at any level in a repository, the index server removes the affected content from the index.
-
Crawl rules have been added, deleted, or modified.
-
To repair a corrupted index.
-
The search services administrator has created one or more server name mappings.
-
The account assigned to the default content access account or crawl rule has changed.
The system does a full crawl even when an incremental crawl is requested under the following circumstances:
-
An SSP administrator stopped the previous crawl.
-
A content database was restored from backup. If you are running the Infrastructure Update for Microsoft Office Servers, you can use the restore operation of the stsadm command-line tool to change whether a content database restore causes a full crawl.
-
A farm administrator has detached and reattached a content database.
-
A full crawl of the site has never been done.
-
The change log does not contain entries for the addresses that are being crawled. Without entries in the change log for the items being crawled, incremental crawls cannot occur.
-
The account assigned to the default content access account or crawl rule has changed.
-
To repair a corrupted index. Depending upon the severity of the corruption, the system might attempt to perform a full crawl if corruption is detected in the index.
You can adjust schedules after the initial deployment based on the performance and capacity of servers in the farm and the servers hosting content.
Limit or increase the quantity of content that is crawled
For each content source, you can select how extensively to crawl the start addresses in that content source. You also specify the behavior of the crawl, sometimes called the crawl settings. The options you can choose for a particular content source vary
somewhat based on the content source type that you select. However, most options determine how many levels deep in the hierarchy from each start address listed in the content source are crawled. Note that this behavior is applied to all start addresses in
a particular content source. If you need to crawl some sites at deeper levels, you can create additional content sources that encompass those sites.
The options available in the properties for each content source vary depending upon the content source type that is selected. The following table describes the crawl settings options for each content source type.
Content source type |
Crawl settings options |
SharePoint sites
|
Everything under the host name for each start address
Only the SharePoint site of each start address
|
Web sites
|
Only within the server of each start address
Only the first page of each start address
Custom — Specify page depth and number of server hops.
The default setting for this option is unlimited page depths and server hops.
|
File shares
|
The folder and all subfolders of each start address
Only the folder of each start address
|
Exchange public folders
|
The folder and all subfolders of each start address
Only the folder of each start address
|
Business data
|
Crawl entire Business Data Catalog
Crawl selected applications
|
As the preceding table shows, shared services administrators can use crawl setting options to limit or increase the quantity of content that is crawled.
The following table describes best practices when configuring crawl setting options.
For this content source type |
Use this crawl setting option |
Use this crawl setting option |
SharePoint sites
|
You want to include the content on the site itself.
-or-
You do not want to include the content available on subsites, or you want to crawl them on a different schedule.
|
Crawl only the SharePoint site of each start address
|
SharePoint sites
|
You want to include the content on the site itself.
-or-
You want to crawl all content under the start address on the same schedule.
|
Crawl everything under the host name of each start address
|
Web sites
|
Content on the site itself is relevant.
-or-
Content available on linked sites is not likely to be relevant.
|
Crawl only within the server of each start address
|
Web sites
|
Relevant content is on only the first page.
|
Crawl only the first page of each start address
|
Web sites
|
You want to limit how deep to crawl the links on the start addresses.
|
Custom — Specify the number of pages deep and number of server hops to crawl
We recommend you start with a small number on a highly connected site because specifying more than three pages deep or more than three server hops can crawl the entire Internet.
|
File shares
Exchange public folders
|
Content available in the subfolders is not likely to be relevant.
|
Crawl only the folder of each start address
|
File shares
Exchange public folders
|
Content in the subfolders is likely to be relevant.
|
Crawl the folder and subfolder of each start address
|
Business data
|
All applications that are registered in the Business Data Catalog contain relevant content.
|
Crawl the entire Business Data Catalog
|
Business data
|
Not all applications that are registered in the Business Data Catalog contain relevant content.
-or-
You want to crawl some applications on a different schedule.
|
Crawl selected applications
|
Plan file-type inclusions and IFilters
Content is only crawled if the relevant file name extension is included in the file-type inclusions list and an IFilter is installed on the index server that supports those file types. Several file types are included automatically during initial installation.
When you plan for content sources in your initial deployment, determine whether content you want to crawl uses file types that are not included. If file types are not included, you must add those file types on the Manage File Types page during deployment and
ensure that an IFilter is installed and registered to support that file type.
Office SharePoint Server 2007 provides several IFilters, and more are available from Microsoft and third-party vendors. For information about how to install and register additional IFilters that are available from Microsoft, see
How to register Microsoft Filter Pack with SharePoint Server 2007 and with Search Server 2008 (http://go.microsoft.com/fwlink/?LinkId=110532). If necessary, software developers can create IFilters for new file types.
On the other hand, if you want to exclude certain file types from being crawled, you can delete the file name extension for that file type from the file type inclusions list. Doing so excludes file names that have that extension from being crawled.
For a table that lists which file types are supported by the IFilters that are installed by default and which file types are enabled on the
Manage File Types page by default, see
IFilters in Office SharePoint Server 2007.
IFilters and Microsoft Office OneNote
An IFilter is not provided for the .one file name extension used by Microsoft Office OneNote. If you want users to be able to search content in Office OneNote files, you must install an IFilter for OneNote. To do this, you must do one of the following.
Limit or exclude content by using crawl rules
When you add a start address to a content source and accept the default behavior, all subsites or folders below that start address are crawled unless you exclude them by using one or more crawl rules.
For more information about crawl rules, see
Plan crawl rules later in this article.
Other considerations when planning content sources
You cannot crawl the same addresses using multiple content sources. For example, if you use a particular content source to crawl a site collection and all of its subsites, you cannot use a different content source to crawl one of those subsites separately
on a different schedule. To accommodate this restriction, you might need to crawl some of these sites separately. Consider the following scenario:
The SSP administrator at Contoso wants to crawl http://contoso, which contains the http://contoso/sites/site1 and http://contoso/sites/site2 subsites. He wants to crawl http://contoso/sites/site2 on a different schedule than the other sites. To achieve this,
he adds the addresses http://contoso and http://contoso/sites/site1 to one content source and selects the
Crawl only the SharePoint site of each start address setting. He then adds http://contoso/sites/site2 to another content source and specifies a different schedule for that content source.
In addition to crawl schedules, there are other things to consider when planning content sources. For example, whether you group start addresses in a single content source or create additional content sources to crawl those start addresses depends largely
upon administration considerations. Administrators often make changes that require a full update of a particular content source. Changes to a content source require a full crawl of that content source. To make administration easier, organize content sources
in such a way that updating content sources, crawl rules, and crawling content is convenient for administrators.
Content sources summary
Consider the following when planning your content sources:
-
A particular content source can be used to crawl only one of the following content types: SharePoint sites, Web sites that are not SharePoint sites, file shares, Exchange public folders, Lotus Notes databases, and business data.
-
Shared services administrators can create up to 500 content sources in each SSP, and each content source can contain up to 500 start addresses. To keep administration as simple as possible, you should create only as many content sources as you absolutely
need.
-
Each URL in a particular content source must be of the same content source type.
-
For a particular content source, you can choose how deep to crawl from the start addresses. These configuration settings apply to all start addresses in the content source. The available choices on how deep you can crawl the start addresses differ depending
upon the content source type that is selected.
-
You can schedule when to perform either a full or incremental crawl for the entire content source. For more information about scheduling crawls, see "Full and incremental crawl schedules" earler in this article.
-
Shared services administrators can modify the default content source, create additional content sources for crawling other content, or both. For example, they can configure the default content source to also crawl content on a different server farm or they
can create a new content source to crawl other content.
-
To effectively crawl all the content needed by your organization, use as many content sources as make sense for the types of sources you want to crawl, and for the frequency at which you plan to crawl them.
Plan for authentication
When the crawler accesses the start addresses that are listed in content sources, the crawler must be authenticated by and granted access to the servers that host that content. This means that the domain account used by the
crawler must have at least read permission to the content.
The default content access account is the account that is used by default when crawling content sources. This account is specified by the shared services administrator. Alternatively, you can use crawl rules to specify a different
content access account to use when crawling particular content. Regardless whether you use the default content access account or a different content access account specified by a crawl rule, the content access account that you use must have read access to
all content that is crawled, or the content is not crawled and is not available to queries.
We recommend that you select a default content access account that has the broadest access to most of your crawled content, and only use other content access accounts when security considerations require separate content access
accounts. For information about creating a separate content access accounts to crawl content that cannot be read using the default content access account, see
Plan crawl rules later in this article.
For each content source you plan, identify the start addresses that cannot be accessed by the default content access account and plan to add crawl rules for URL patterns that encompass those start addresses.
Ensure that the domain account used for the default content access account or any other content access account is not the same domain account that is used by an application pool associated with any Web application you crawl.
Doing so can cause unpublished content in SharePoint sites and minor versions of files (history) in SharePoint sites to be crawled and indexed.
Another important consideration is that the crawler must use the same authentication method as the host server. By default, the crawler attempts to authenticate using NTLM authentication. You can configure the crawler to use
a different authentication method, if necessary. For more information, see "Authentication requirements for crawling content" in
Plan authentication methods (Office SharePoint Server).
Plan protocol handlers
All content that is crawled requires the use of a protocol handler to gain access to that content. Office SharePoint Server 2007 provides protocol handlers for all common Internet protocols. However, if you want to crawl content
that requires a protocol handler that is not installed with Office SharePoint Server 2007, you must install the third-party or custom protocol handler before you can crawl that content.
For a list that shows the protocol handlers that are installed by default, see
Protocol handlers in Office SharePoint Server 2007.
Plan to manage the impact of crawling
Crawling content can significantly decrease the performance of the servers that host the content. The impact that this has on a particular server varies depending upon the load that the host server is experiencing and whether the server has sufficient resources
(particularly CPU and RAM) to maintain service level agreements under normal or peak usage.
Crawler impact rules enable farm administrators to manage the impact your crawler has on the servers being crawled. For each crawler impact rule, you can specify a single URL or use wildcard characters in the URL path to include a block of URLs to which
the rule applies. You can then specify how many simultaneous requests for pages are made to the specified URL or choose to request only one document at a time and wait a number of seconds that you choose between requests.
Crawler impact rules reduce or increase the rate at which the crawler requests content from a particular start address or range of start addresses (sometimes called a site name), regardless of the content source used to crawl those addresses. The following
table shows the wildcard characters that you can use in the site name when adding a rule.
Wildcard to use |
Result |
* as the site name
|
Applies the rule to all sites.
|
*.* as the site name
|
Applies the rule to sites with dots in the name.
|
*.site_name.com as the site name
|
Applies the rule to all sites in the site_name.com domain (for example, *.adventure-works.com).
|
*.top-level_domain_name as the site name
|
Applies the rule to all sites that end with a specific top-level domain name, for example, *.com or *.net.
|
?
|
Replaces a single character in a rule. For example, *.adventure-works?.com applies to all sites in the domains adventure-works1.com, adventure-works2.com, and so on.
|
You can create a crawler impact rule that applies to all sites within a particular top-level domain. For example, *.com applies to all Internet sites with addresses that end in .com. For example, an administrator of a portal site might add a content source
for samples.microsoft.com. The rule for *.com applies to this site unless you add a crawler impact rule specifically for samples.microsoft.com.
For content within your organization that other administrators are crawling, you can coordinate with those administrators to set crawler impact rules based on the performance and capacity of the servers. For most external sites, this coordination is not
possible. Requesting too much content on external servers or making requests too frequently can cause administrators of those sites to limit your future access if your crawls are using too many resources or too much bandwidth. Therefore, the best practice
is to crawl more slowly. In this way, you can mitigate the risk of losing access to crawl the relevant content.
During initial deployment, set the crawler impact rules to make as small an impact on other servers as possible while still crawling enough content frequently enough to ensure the freshness of the crawled content.
During the operations phase, you can adjust crawler impact rules based on your experiences and data from crawl logs.
Plan crawl rules
Crawl rules apply to a particular URL or set of URLs represented by wildcards (also referred to as the path affected by the rule). You use crawl rules to do the following things:
-
Avoid crawling irrelevant content by excluding one or more URLs. This also helps to reduce the use of server resources and network traffic, and to increase the relevance of search results.
-
Crawl links on the URL without crawling the URL itself. This option is useful for sites with links of relevant content when the page containing the links does not contain relevant information.
-
Enable complex URLs to be crawled. This option crawls URLs that contain a query parameter specified with a question mark. Depending upon the site, these URLs might or might not include relevant content. Because complex URLs can often redirect to irrelevant
sites, it is a good idea to enable this option on only sites where the content available from complex URLs is known to be relevant.
-
Enable content on SharePoint sites to be crawled as HTTP pages. This option enables the index server to crawl SharePoint sites that are behind a firewall or in scenarios in which the site being crawled restricts access to the Web service used by the crawler.
-
Specify whether to use the default content access account, a different content access account, or a client certificate for crawling the specified URL.
-
Enable complex URLs to be crawled. This option crawls URLs that contain a query parameter specified with a question mark. Depending upon the site, these URLs might or might not include relevant content. Because complex URLs can often redirect to irrelevant
sites, it is a good idea to enable this option on only sites where the content available from complex URLs is known to be relevant.
-
Enable content on SharePoint sites to be crawled as HTTP pages. This option enables the index server to crawl SharePoint sites that are behind a firewall or in scenarios in which the site being crawled restricts access to the Web service used by the crawler.
-
Specify whether to use the default content access account, a different content access account, or a client certificate for crawling the specified Uli>
Crawl rules apply simultaneously to all content sources in the SSP.
Often, most of the content for a particular site address is relevant, but not a specific subsite or range of sites below that site address. By selecting a focused combination of URLs for which to create crawl rules that exclude unneeded items, shared services
administrators can maximize the relevance of the content in the index while minimizing the impact on crawling performance and the size of search databases. Creating crawl rules to exclude URLs is particularly useful when planning start addresses for external
content, the impact on resource usage of which is not under the control of people in your organization.
When creating a crawl rule, you can use standard wildcard characters in the path. For example:
Because crawling content consumes resources and bandwidth, it is better to include a smaller amount of content that you know is relevant than a larger amount of content that might be irrelevant. After the initial deployment, you can review the query and
crawl logs and adjust content sources and crawl rules to be more relevant and include more content.
Specify a different content access account
For crawl rules that include content, administrators have the option of changing the content access account for the rule. The default content access account is used unless another account is specified in a crawl rule. The main reason to use a different content
access account for a crawl rule is that the default content access account does not have access to all start addresses. For those start addresses, you can create a crawl rule and specify an account that does have access.
Ensure that the domain account used for the default content access account or any other content access account is not the same domain account that is used by an application pool associated with any Web application you crawl. Doing so can cause unpublished
content in SharePoint sites and minor versions of files (history) in SharePoint sites to be crawled and indexed.
Plan search settings that are managed at the farm level
In addition to the settings that are configured at the SSP level, several settings that are managed at the farm level affect how content is crawled. Consider the following farm-level search settings while planning for crawling:
-
Contact e-mail address: Crawling content affects the resources of the servers that are being crawled. Before you can crawl content, you must provide in the configuration settings the e-mail address of the person in your organization whom
administrators can contact in the event that the crawl adversely affects their servers. This e-mail address appears in logs for administrators of the servers being crawled so that those administrators can contact someone if the impact of crawling on their
performance and bandwidth is too great, or if other issues occur.
The contact e-mail address should belong to a person who has the necessary expertise and availability to respond quickly to requests. Alternatively, you can use a closely monitored distribution-list alias as the contact e-mail address. Regardless of whether
the content crawled is stored internally to the organization or not, quick response time is important.
-
Proxy server settings: You can choose whether to use a proxy server when crawling content. The proxy server to use depends upon the topology of your Office SharePoint Server 2007 deployment and the architecture of other servers in your organization.
-
Time-out settings: The time-out settings are used to limit the time that the search server waits while connecting to other services.
-
SSL setting: The Secure Sockets Layer (SSL) setting determines whether the SSL certificate must exactly match to crawl content.
Indexing content in different languages
When crawling content, the crawler determines each individual word in the content it finds. Languages that have words separated by spaces make it relatively easy for the crawler to distinguish each word. In other languages, finding the boundary between words
can be more complex.
Office SharePoint Server 2007 provides word breakers and stemmers by default to help crawl and index content in many languages. Word breakers find word boundaries in full-text indexed data, while stemmers conjugate verbs.
If you are crawling any of the languages in the article
Word breakers and stemmers by language in Office SharePoint Server 2007, Office SharePoint Server 2007 automatically uses the appropriate word breaker and stemmer for that language. An asterisk (*) indicates that the stemming feature is on by default.
When the crawler indexes content for a language that is not supported, the neutral breaker is used. If the neutral breaker does not give you the results you expect, you can try third-party solutions that work with Office SharePoint Server 2007.
As a best practice, be sure that you install the appropriate word breaker and stemmer for each of the languages that you need to support. Word breakers and stemmers must be installed on all of the servers that are running the Office SharePoint Server Search
service.
For more information about word breakers and stemmers, see
Plan for multilingual sites.
07 automatically uses the appropriate word breaker and stemmer for that language. An asterisk (*) indicates that the stemming feature is on by default.
When the crawler indexes content for a language that is not supported, the neutral breaker is used. If the neutral breaker does not give you the results you expect, you can try third-party solutions that work with Office SharePoint Server 2007.
As a best practice