The purpose of this article is to help search services administrators understand how Microsoft Office SharePoint Server 2007 crawls and indexes content and to help them plan to crawl content.


SharePoint 2007: How to Plan for Crawling Cp://social.technet.microsoft.com/wiki/resized-image.ashx/__size/550x0/__key/CommunityServer.Wikis.Components.Files/articles/0474.howto03.png" />

This topic is a how to.
Please keep it as clear and simple as possible. Avoid speculative discussions as well as a deep dive into underlying mechanisms or related technologies.

Table of Contents



Before end users can use the enterprise search functionality in Office SharePoint Server 2007 to search for content, you must first crawl the content that you want to make available for users to query.

For the purpose of this article, content is any item that can be crawled, such as a Web page, a Microsoft Office Word document, business data, or an e-mail message file.

When planning to crawl content, you should consider the following questions:

Use the information in this article to help you answer these questions and make the necessary planning decisions about the content you want to crawl and how and when you want to crawl that content.

Office SharePoint Server 2007 includes the Office SharePoint Server Search service, which is used to crawl and index content. This service is part of an SSP and all content crawled using a particular SSP is indexed in a single index. For information about choosing the number of SSPs to use to index content, see Plan Shared Services Providers.

About crawling and indexing content

Crawling and indexing content is the process by which the system accesses and parses content and its properties, sometimes called metadata, to build a content index from which search queries can be served.

The result of successfully crawling content is that the individual files or pieces of content that you want to make available to search queries are accessed and read by the crawler. The keywords and metadata for those files are stored in the content index, sometimes called the index. The index consists of the keywords that are stored in the file system of the index server and the metadata that is stored in the search database. The system maintains a mapping between the keywords, the metadata associated with the individual pieces of content from which the keywords were crawled, and the URL of the source from which the content was crawled.

Note: The crawler does not change the files on the host servers in any way. Instead, the files on the host servers are simply accessed, read, and the text and metadata for those files are sent to the index server to be indexed. However, because the crawler reads the content on the host server, some servers that host certain sources of content might update the last accessehe individual files or pieces of content that you want to make available to search queries are accessed and read by the crawler. The keywords and metadata fod date on files that have been crawled.

Identify the sources of content that you want to crawl

In many cases, the needs of your organization might only require that you crawl all the content contained by the SharePoint sites in your organization's server farm. In this case, you might not need to identify the sources of content you want to crawl because all site collections in a server farm can be crawled using the default content source. For more information about the default content source, see "Plan content sources" later in this article.

Many organizations also need to crawl content that is external from the server farm, such as file shares or Web sites on the Internet. Office SharePoint Server 2007 can crawl and index content that is hosted on other Windows SharePoint Services or Office SharePoint Server farms, Web sites, file shares, Microsoft Exchange public folders, IBM Lotus Notes servers, and business data that is stored in databases. This greatly increases the amount of content that can be made available to search queries.

In many cases, however, you might not want to crawl every site collection in your server farm, because content stored in some site collections might not be relevant in search results. In this case, you must do one or both of the following:

With the Infrastructure Update for Microsoft Office Servers installed, there are two ways to process search queries in order to return search results to users. You can query the Search Server content index, or you can use federated search. Note that the Infrastructure Update for Microsoft Office Servers provides Office SharePoint Server 2007 with the federated search capability that first appeared in Search Server 2008.

There are advantages to each approach. For a comparison of these two approaches to processing search queries, see Federated Search Overview (http://go.microsoft.com/fwlink/?LinkID=122651). For a list and brief description of articles about understanding and using federation, see Working with Federation (Office SharePoint Server). For more information about the Infrastructure Update for Microsoft Office Servers, see Install the Infrastructure Update for Microsoft Office Servers (Office SharePoint Server 2007).

Plan content sources

Before you can crawl content, you must first determine where the content is and on what types of servers the content is hosted. After this information is gathered, a shared services administrator can create one or more content sources that are used to crawl that content. These content sources provide the following information to the crawler during a crawl:

This section helps you plan for the content sources needed by your organization.

The default content source is called Local Office SharePoint Server sites. Shared services administrators can use this content source to crawl and index all content in all Web applications associated with the SSP. By default, Office SharePoint Server 2007 adds the start address (in this case a URL) of the top-level site of each site collection created in the Web application that uses the same SSP to the default content source.

For some organizations, simply using the default content source to crawl all sites in their site collections satisfies their search requirements. However, many organizations need additional content sources.

Reasons for creating additional content sources include the need to:

Shared services administrators can create up to 500 content sources in each SSP and each content source can contain up to 500 start addresses. To keep administration as simple as possible, you should create only as many content sources as you need.

Crawl different types of content

You can only crawl one type of content per content source. That is, you can create a content source that contains URLs for SharePoint sites and another that contains URLs for file shares, but you cannot create a single content source that contains URLs to both SharePoint sites and file shares. The following table lists the types of content sources that can be configured

This type of content source Includes this type of content

SharePoint sites

SharePoint sites from the same farm or different Office SharePoint Server 2007, Windows SharePoint Services 3.0, or Search Server 2008 farms.

SharePoint sites from Microsoft Office SharePoint Portal Server 2003 or Microsoft Windows SharePoint Services 2.0 farms.

Note: Unlike crawling SharePoint sites on Office SharePoint Server 2007, Windows SharePoint Services 3.0, or Search Server 2008, the crawler cannot automatically crawl all subsites in a site collection from previous versions of SharePoint Products and Technologies. Therefore, when crawling SharePoint sites from previous versions, you must specify the URL of each top-level site and each subsite that you want to crawl. Sites listed in the Site Directory of Microsoft Office SharePoint Portal Server 2003 farms are crawled when the portal site is crawled. For more information about the Site Directory, see About the Site Directory (http://go.microsoft.com/fwlink/?LinkId=88227&clcid=0x409). 

Web sites

Other Web content in your organization not found on SharePoint sites

Content on Web sites on the Internet

File shares

Content on file shares within your organization

Exchange public folders

Microsoft Exchange Server content

Lotus Notes

E-mail messages stored in Lotus Notes databases

Note: Unlike all other types of content sources, the Lotus Notes content source option does not appear in the user interface until you have installed and configured the appropriate prerequisite software. For more information, see Configure Office SharePoint Server Search to crawl Lotus Notes (Office SharePoint Server 2007).

Business data

Business data stored in line-of-business applications

Plan content sources for business data 

Business data content sources require that the applications hosting the data are first registered in the Business Data Catalog. You must create one or more separate content sources of the Business Data content source types to crawl business data. You can create one content source to crawl all applications registered in the Business Data Catalog, or you can create separate content sources to crawl individual applications that are registered in the Business Data Catalog.

Often, the people who plan for integration of business data into your site collections are not the same people involved in the overall content planning process. Therefore, include business application administrators in your content planning teams so that thtd>

Business data

Business data stored in line-of-business applications

Plan content sources for business data