The purpose of this article is to help search services administrators understand how Microsoft Office SharePoint Server 2007 crawls and indexes content and to help them plan to crawl content.

SharePoint 2007: How to Plan for Crawling Cp://social.technet.microsoft.com/wiki/resized-image.ashx/size/550x0/key/CommunityServer.Wikis.Components.Files/articles/0474.howto03.png" />

This topic is a how to.
Please keep it as clear and simple as possible. Avoid speculative discussions as well as a deep dive into underlying mechanisms or related technologies.

Before end users can use the enterprise search functionality in Office SharePoint Server 2007 to search for content, you must first crawl the content that you want to make available for users to query.

For the purpose of this article, content is any item that can be crawled, such as a Web page, a Microsoft Office Word document, business data, or an e-mail message file.

When planning to crawl content, you should consider the following questions:

Where is the content that you want to crawl physically located?
Is some of the content that you want to crawl stored in different types of sources, such as file shares, SharePoint sites, Web sites, or other places?
Do you want to crawl all the content at specific sources or just some of it?
What types of files make up the content that you want to crawl?
When and how often should you crawl content?
How is this content secured?

Use the information in this article to help you answer these questions and make the necessary planning decisions about the content you want to crawl and how and when you want to crawl that content.

Office SharePoint Server 2007 includes the Office SharePoint Server Search service, which is used to crawl and index content. This service is part of an SSP and all content crawled using a particular SSP is indexed in a single index. For information about choosing the number of SSPs to use to index content, see Plan Shared Services Providers.

About crawling and indexing content

Crawling and indexing content is the process by which the system accesses and parses content and its properties, sometimes called metadata, to build a content index from which search queries can be served.

The result of successfully crawling content is that the individual files or pieces of content that you want to make available to search queries are accessed and read by the crawler. The keywords and metadata for those files are stored in the content index, sometimes called the index. The index consists of the keywords that are stored in the file system of the index server and the metadata that is stored in the search database. The system maintains a mapping between the keywords, the metadata associated with the individual pieces of content from which the keywords were crawled, and the URL of the source from which the content was crawled.

Note: The crawler does not change the files on the host servers in any way. Instead, the files on the host servers are simply accessed, read, and the text and metadata for those files are sent to the index server to be indexed. However, because the crawler reads the content on the host server, some servers that host certain sources of content might update the last accessehe individual files or pieces of content that you want to make available to search queries are accessed and read by the crawler. The keywords and metadata fod date on files that have been crawled.

Identify the sources of content that you want to crawl

In many cases, the needs of your organization might only require that you crawl all the content contained by the SharePoint sites in your organization's server farm. In this case, you might not need to identify the sources of content you want to crawl because all site collections in a server farm can be crawled using the default content source. For more information about the default content source, see "Plan content sources" later in this article.

Many organizations also need to crawl content that is external from the server farm, such as file shares or Web sites on the Internet. Office SharePoint Server 2007 can crawl and index content that is hosted on other Windows SharePoint Services or Office SharePoint Server farms, Web sites, file shares, Microsoft Exchange public folders, IBM Lotus Notes servers, and business data that is stored in databases. This greatly increases the amount of content that can be made available to search queries.

In many cases, however, you might not want to crawl every site collection in your server farm, because content stored in some site collections might not be relevant in search results. In this case, you must do one or both of the following:

Note the site collections that you do not want to crawl. If you decide to use the default content source, you must ensure that the start addresses for the site collections you do not want to crawl are not listed in the default content source.
Note the individual start addresses of the site collections that you do want to crawl. If you decide to create additional content sources to use to crawl this content, you need to know these start addresses. For information about when to use one or more content sources, see "Plan content sources" later in this article.

With the Infrastructure Update for Microsoft Office Servers installed, there are two ways to process search queries in order to return search results to users. You can query the Search Server content index, or you can use federated search. Note that the Infrastructure Update for Microsoft Office Servers provides Office SharePoint Server 2007 with the federated search capability that first appeared in Search Server 2008.

There are advantages to each approach. For a comparison of these two approaches to processing search queries, see Federated Search Overview (http://go.microsoft.com/fwlink/?LinkID=122651). For a list and brief description of articles about understanding and using federation, see Working with Federation (Office SharePoint Server). For more information about the Infrastructure Update for Microsoft Office Servers, see Install the Infrastructure Update for Microsoft Office Servers (Office SharePoint Server 2007).

Plan content sources

Before you can crawl content, you must first determine where the content is and on what types of servers the content is hosted. After this information is gathered, a shared services administrator can create one or more content sources that are used to crawl that content. These content sources provide the following information to the crawler during a crawl:

Type of content you want to crawl — for example, a SharePoint site or a file share.
Start address from which to start crawling.
Behavior to use when crawling — for example, how deep to crawl from the start address, or how many server hops to allow.
Crawling schedule.

This section helps you plan for the content sources needed by your organization.

The default content source is called Local Office SharePoint Server sites. Shared services administrators can use this content source to crawl and index all content in all Web applications associated with the SSP. By default, Office SharePoint Server 2007 adds the start address (in this case a URL) of the top-level site of each site collection created in the Web application that uses the same SSP to the default content source.

For some organizations, simply using the default content source to crawl all sites in their site collections satisfies their search requirements. However, many organizations need additional content sources.

Reasons for creating additional content sources include the need to:

Crawl different types of content.
Crawl some content on different schedules than other content.
Limit or increase the quantity of content that is crawled.

Shared services administrators can create up to 500 content sources in each SSP and each content source can contain up to 500 start addresses. To keep administration as simple as possible, you should create only as many content sources as you need.

Crawl different types of content

You can only crawl one type of content per content source. That is, you can create a content source that contains URLs for SharePoint sites and another that contains URLs for file shares, but you cannot create a single content source that contains URLs to both SharePoint sites and file shares. The following table lists the types of content sources that can be configured

This type of content source	Includes this type of content
SharePoint sites	SharePoint sites from the same farm or different Office SharePoint Server 2007, Windows SharePoint Services 3.0, or Search Server 2008 farms. SharePoint sites from Microsoft Office SharePoint Portal Server 2003 or Microsoft Windows SharePoint Services 2.0 farms. Note: Unlike crawling SharePoint sites on Office SharePoint Server 2007, Windows SharePoint Services 3.0, or Search Server 2008, the crawler cannot automatically crawl all subsites in a site collection from previous versions of SharePoint Products and Technologies. Therefore, when crawling SharePoint sites from previous versions, you must specify the URL of each top-level site and each subsite that you want to crawl. Sites listed in the Site Directory of Microsoft Office SharePoint Portal Server 2003 farms are crawled when the portal site is crawled. For more information about the Site Directory, see About the Site Directory (http://go.microsoft.com/fwlink/?LinkId=88227&clcid=0x409).
Web sites	Other Web content in your organization not found on SharePoint sites Content on Web sites on the Internet
File shares	Content on file shares within your organization
Exchange public folders	Microsoft Exchange Server content
Lotus Notes	E-mail messages stored in Lotus Notes databases Note: Unlike all other types of content sources, the Lotus Notes content source option does not appear in the user interface until you have installed and configured the appropriate prerequisite software. For more information, see Configure Office SharePoint Server Search to crawl Lotus Notes (Office SharePoint Server 2007).
Business data	Business data stored in line-of-business applications

Plan content sources for business data

Business data content sources require that the applications hosting the data are first registered in the Business Data Catalog. You must create one or more separate content sources of the Business Data content source types to crawl business data. You can create one content source to crawl all applications registered in the Business Data Catalog, or you can create separate content sources to crawl individual applications that are registered in the Business Data Catalog.

Often, the people who plan for integration of business data into your site collections are not the same people involved in the overall content planning process. Therefore, include business application administrators in your content planning teams so that thtd>

Business data

Business data stored in line-of-business applications

Plan content sources for business data

Business data content sources require that the applications hosting the data are first registered in the Business Data Catalog. You must create one or more separate content sources of the Business Data conteney can advise you how to integrate their data into your other content and effectively present it in your site collections.

Crawl content on different schedules

Shared services administrators often must decide whether some content is crawled more frequently than other content. The larger the volume of content that you crawl, the more likely it is that you are crawling content from different sources. These different sources might or might not be of the same type and might be hosted on servers of varying speeds in relation to one another.

These factors make it more likely that you need additional content sources to crawl those different sources of content at different times.

In many cases, not all of this information can be known until after Office SharePoint Server 2007 is deployed and running for some time. Instead, some of these decisions are made during the operations phase. However, it is a good idea to consider these factors during planning so that you can plan your crawl schedules based on the information at hand.

The following two sections provide more information about crawling content on different schedules.

Downtimes and periods of peak usage

Consider the downtimes and peak usage times of the servers that host the content you want to crawl. For example, if you are crawling content hosted by many different servers outside your server farm, it is likely that these servers are backed-up on different schedules and have different peak usage times. The administration of servers outside your server farm is typically out of your control. Therefore, we recommend that you coordinate your crawls with the administrators of the servers that host the content you want to want to crawl to ensure you do not attempt to crawl content on their servers during a downtime or peak usage time.

A common scenario involves content outside the control of your organization that relates to the content on your SharePoint sites. You can add the start addresses for this content to an existing content source or create a new content source for external content. Because availability of external sites varies widely, it is helpful to add separate content sources for different external content. In this way, the content sources for external content can be crawled at different times than your other content sources. You can then update external content on a crawl schedule that accounts for the availability of each site.

Content that is updated frequently

When planning crawl schedules, consider that some sources of content are typically updated more frequently than others. For example, if you know that content on some site collections or external sources are updated only on Fridays, it would be a waste of resources to crawl that content more frequently than once each week. However, your server farm might contain other site collections that are continually updated Monday through Friday, but not typically updated on Saturdays and Sundays. In this case, that you might want to crawl several times each week day, but only once or twice on weekends.

The way in which content is stored across the site collections in your environment can guide you to create additional content sources for each of your site collections in each of your Web applications. For example, if a site collection stores only archived information, you may not need to crawl that content as frequently as you crawl a site collection that stores frequently updated content. In this case, you might want to crawl these two site collections using different content sources so that they can be crawled on different schedules without having to crawl the archive sites as frequently as the other content.

Full and incremental crawl schedules

Because a full crawl crawls all content that the crawler encounters and has at least read access to, regardless of whether that content has been previously crawled, full crawls can take significantly more time to complete than incremental crawls.

We recommend that you plan crawl schedules based on the availability, performance, and bandwidth considerations of the servers running the search service and the servers hosting the crawled content.

Content source type	Crawl settings options
SharePoint sites	Everything under the host name for each start address Only the SharePoint site of each start address
Web sites	Only within the server of each start address Only the first page of each start address Custom — Specify page depth and number of server hops. The default setting for this option is unlimited page depths and server hops.
File shares	The folder and all subfolders of each start address Only the folder of each start address
Exchange public folders	The folder and all subfolders of each start address Only the folder of each start address
Business data	Crawl entire Business Data Catalog Crawl selected applications

For this content source type	Use this crawl setting option	Use this crawl setting option
SharePoint sites	You want to include the content on the site itself. -or- You do not want to include the content available on subsites, or you want to crawl them on a different schedule.	Crawl only the SharePoint site of each start address
SharePoint sites	You want to include the content on the site itself. -or- You want to crawl all content under the start address on the same schedule.	Crawl everything under the host name of each start address
Web sites	Content on the site itself is relevant. -or- Content available on linked sites is not likely to be relevant.	Crawl only within the server of each start address
Web sites	Relevant content is on only the first page.	Crawl only the first page of each start address
Web sites	You want to limit how deep to crawl the links on the start addresses.	Custom — Specify the number of pages deep and number of server hops to crawl We recommend you start with a small number on a highly connected site because specifying more than three pages deep or more than three server hops can crawl the entire Internet.
File shares Exchange public folders	Content available in the subfolders is not likely to be relevant.	Crawl only the folder of each start address
File shares Exchange public folders	Content in the subfolders is likely to be relevant.	Crawl the folder and subfolder of each start address
Business data	All applications that are registered in the Business Data Catalog contain relevant content.	Crawl the entire Business Data Catalog
Business data	Not all applications that are registered in the Business Data Catalog contain relevant content. -or- You want to crawl some applications on a different schedule.	Crawl selected applications

Wildcard to use	Result
* as the site name	Applies the rule to all sites.
. as the site name	Applies the rule to sites with dots in the name.
*.site_name.com as the site name	Applies the rule to all sites in the site_name.com domain (for example, *.adventure-works.com).
*.top-level_domain_name as the site name	Applies the rule to all sites that end with a specific top-level domain name, for example, .com or .net.
?	Replaces a single character in a rule. For example, *.adventure-works?.com applies to all sites in the domains adventure-works1.com, adventure-works2.com, and so on.

SharePoint 2007: How to Plan for Crawling Content - TechNet Articles - United States (English) - TechNet Wiki

SharePoint 2007: How to Plan for Crawling Cp://social.technet.microsoft.com/wiki/resized-image.ashx/size/550x0/key/CommunityServer.Wikis.Components.Files/articles/0474.howto03.png" />

Table of Contents

About crawling and indexing content

Identify the sources of content that you want to crawl

Plan content sources

Crawl different types of content

Plan content sources for business data

Plan content sources for business data

Crawl content on different schedules

Downtimes and periods of peak usage

Content that is updated frequently

Full and incremental crawl schedules

Limit or increase the quantity of content that is crawled

Plan file-type inclusions and IFilters

IFilters and Microsoft Office OneNote

Limit or exclude content by using crawl rules

Other considerations when planning content sources

Content sources summary

Plan for authentication

Plan protocol handlers

Plan to manage the impact of crawling

Plan crawl rules

Specify a different content access account

Plan search settings that are managed at the farm level

Indexing content in different languages

SharePoint 2007: How to Plan for Crawling Content - TechNet Articles - United States (English) - TechNet Wiki

SharePoint 2007: How to Plan for Crawling Cp://social.technet.microsoft.com/wiki/resized-image.ashx/__size/550x0/__key/CommunityServer.Wikis.Components.Files/articles/0474.howto03.png" />

Table of Contents

About crawling and indexing content

Identify the sources of content that you want to crawl

Plan content sources

Crawl different types of content

Plan content sources for business data

Plan content sources for business data

Crawl content on different schedules

Downtimes and periods of peak usage

Content that is updated frequently

Full and incremental crawl schedules

Limit or increase the quantity of content that is crawled

Plan file-type inclusions and IFilters

IFilters and Microsoft Office OneNote

Limit or exclude content by using crawl rules

Other considerations when planning content sources

Content sources summary

Plan for authentication

Plan protocol handlers

Plan to manage the impact of crawling

Plan crawl rules

Specify a different content access account

Plan search settings that are managed at the farm level

Indexing content in different languages

SharePoint 2007: How to Plan for Crawling Cp://social.technet.microsoft.com/wiki/resized-image.ashx/size/550x0/key/CommunityServer.Wikis.Components.Files/articles/0474.howto03.png" />