TechNet Wiki v2

This article describes or explains the following:

The indexing and search components that comprise the overall search architecture in Office SharePoint Server 2007.
The purpose and capabilities of the different search roles in an Office SharePoint Server 2007 farm.
The indexing processes used by Office SharePoint Server 2007.
How protocol handlers and iFilters fit into the Office SharePoint Server 2007 search architecture.
The use of word breakers and stemmers.
The possible dependencies on 32-bit architecture in specific scenarios.
How Office SharePoint Server 2007 manages and propagates indexes from indexers to query servers.
The query processes in an enterprise search solution based on Office SharePoint Server 2007.

Indexing and Search Architecture

A search solution based on Office SharePoint Server 2007 is comprised of two main components, namely the indexing engine and the query engine. In brief, the indexing engine is responsible for crawling and indexing content that may be stored in a variety of formats and locations in your corpus, while the query engine provides the capability to search the indexed information.

Index Engine

The indexing engine provided by Office SharePoint Server 2007 is capable of crawling a variety of content sources, such as Web sites, SharePoint sites, Exchange Server Public Folders, line-of-business data, and file shares. The indexing engine retrieves the definitions of the content sources from the search configuration data. You have control over defining the content sources to be crawled.

The indexing engine provides the base logic for the indexing process, and loads specific components called protocol handlers to connect to and crawl the different types of content sources. Protocol handlers, in turn, load additional components called iFilters to read the contents of specific file types.

The indexing engine maintains a file-based index, which contains the indexed content. Furthermore, the indexing engine also maintains managed properties (in what is called as the property store or search schema) and scope definitions in the search and configuration databases managed by SQL Server.

Query Engine

Put simply, a Web server initiates each query by collecting terms from the user, and then contacts the query engine to search the full-text index for items that contain the searched-for terms. The results are supplemented with keywords, best-bets, and managed properties from the search configuration database, managed by SQL Server. If the query consists of only a property filter, the Web server needs only to contact the database server, and does not contact the query engine.

Queries are initiated through the Search object model or the Search Web service on Web servers.

Additional components, such as word breaks and stemmers are used throughout the entire process. These components will be discussed in detail later in this paper.

You can physically separate the indexing engine from the query engine by implementing specific roles in your server farm. You can also physically separate the indexing and query engines from the Web servers that expose the search object model and the search Web service.

Server Roles

When you create solutions with SharePoint technologies, you must be aware of the roles that servers can take within your server farm. For search, the roles that you should be aware of are the indexer role, the query server role, the Web server role, and the database server role.

Indexer Role

The indexer role provides the indexing services, such as crawling content, managing crawl schedules, and defining crawl rules. One of the main tasks assigned to the indexer is to crawl content sources and index the information that is stored there. A content source is simply a specification of the type of system and location to be crawled, along with at least one start address. In Office SharePoint Server 2007, you can specify up to 500 start addresses per content source and, furthermore, a Shared Service Provider can define up to 500 content sources.

An indexer is characterized by the following requirements:

Processor. An indexer typically requires a large amount of processor power. Processor utilization on an indexer will most likely be the highest for the indexing process than any other process that occurs in your farm. You should ensure that you have adequate processing power for your indexer; typically, you will require multiple multi-cored processors.
Disk access. An indexer has two typical disk access patterns. While content is being crawled and indexed, the Index Server will exhibit write-intensive characteristics, and if indexes are propagated to Query Servers, then the disk will be read in small fragments. The disk-write operations are the most intensive, so you should optimize your disk configuration for write-access. To achieve this, the recommended disk configuration is physical RAID 10 (disk striping, with mirroring for fault-tolerance).
Memory. Indexing content is not typically a memory-intensive operation, although documents are read into memory for indexing. You can control how much memory is required for indexing purposes to some extent by controlling:
The maximum size of the documents to be indexed.
The degree to which documents are indexed in parallel.
Network. An Index Server makes use of the network primarily at indexing time. It connects to content sources and reads document contents over the network, and it propagates small index fragments to query servers during the indexing process (if propagation occurs).

The processor is most commonly the first bottleneck for an indexer, especially if sufficiently high processor power has not been provided. You should monitor processor utilization at indexing time to determine whether you need to add more processing power to an indexer. However, if you have provided a high processor power, you are most likely to experience network latency problems between the indexer and the content.

Query Server Role

The query server role runs queries over the full-text index. Query servers are managed at the farm level.

If you physically separate the query server role from the indexer onto one or more servers, then the full-text index is propagated from the indexer to all query servers in the farm. Propagation occurs continuously while content is being added to the index on the indexer and you are not required to configure or administer the propagation. Most queries will require results to be returned from the Query Server; the only exception is when the user has issued a query that only consists of a property filter. In that case, the query can be satisfied by the database server role alone.

A query server is characterized by the following requirements:

Processor. Apart from normal processor instructions for reading from disks and managing memory, a query server’s processor requirements vary depending on the size of the index being searched.
Disk access. A query server has two typical disk access patterns. While queries are being satisfied, a query server may exhibit read-intensive characteristics if the data it requires is not held in memory. If index propagation occurs from an Index Server, then the disk will be written to in small fragments and the query server must perform a master merge. Of these two patterns, the disk-read operations are the most intensive, so you should optimize your disk configuration for read-access. You can do this by implementing physical disk striping across multiple hard disks, each with their own physical controller. The recommended disk configuration is physical RAID 10 (disk striping, with mirroring for fault-tolerance).
Memory. Memory is the most intensively used physical resource by a query server. A query server caches the results from recent queries, and will only remove those cached data when either:
It has run out of physical memory and new results must be read into the cache to satisfy a query.
Data in the cache have been invalidated because a crawl has discovered updated information.
Network. A query server makes use of the network primarily at query time. It receives search requests from Web servers, and sends results back over the network. Network resources are also used at indexing time when an Index Server propagates small index fragments to query servers.

Web Server Role

The Web server role responds to search queries for users and applications. The Web server collects query terms from the user, either through built-in Web Parts, custom Web Parts, or from custom applications. Based on the information collected, the Web server is responsible for formulating een invalidated because a crawl has discovered updated information.

Network. A query server makes use of the network primarily at query time. It receives search requests from Web servers, and sends results back over the network. Network resources are also used at indexing time when an Index Server propagates small index fragments to query servers.

Web Server Role

The Web server role responds to search queries for users and applications. The Web server collects query terms from the user, either through built-in Web Parts, custom Web Parts, or from cuthe specific query. Depending on the contents of the query the Web server will contact query servers and the database server to retrieve the required results and access control lists. For example, if the query consists of only a property filter, the Web server needs only to contact the database server, and does not contact the query engine, whereas if the query contains keywords to be searched for, then both the query server and database server will be contacted.

When all of the results and access control lists have been returned to the Web server, it security trims the results, based on the identity of the user who issued the query and the access control lists returned by the database server. After security trimming has taken place, the Web server presents the results, either by rendering them on Web pages or by returning the results to a calling application.

Database Server Role

The database server role performs search-specific actions that apply at configuration, indexing, and query time. All of the administrative search configuration settings are stored by the database server; these settings include content source definitions, crawl rules, and scope definitions.

In addition to storing configuration data in the search configuration database, the database server also stores data that is retrieved from the crawl processes. Specifically, when managed property values and access control lists are retrieved from content sources, their values are stored in the search database. In addition, when a query is issued by a user, the Web server contacts the database server to retrieve managed property values and access control lists, based on the data returned from the query server.

Indexing Processes

The indexing process consists of the following general steps:

The indexer retrieves the start addresses of content sources.
The indexer invokes a protocol handler to connect to and traverse the content source.
The protocol handler identifies content nodes, such as files and Web pages.
The protocol handler retrieves system-level metadata and access control lists (if access control lists are available).
The protocol handler invokes the iFilter associated with the content node type.
The iFilter retrieves content and metadata from the content node.
Content and metadata are parsed by the word breaker and are added to the full-text index.
Metadata and access control lists are added to the search database.

Protocol Handlers

The crawl process requires protocol handlers to connect to content sources and iFilters to access the data stored within files that are located at the content source.

Protocol Handlers

In general, protocol handlers are responsible for:

Connecting to source systems over a given protocol, such as HTTP://.
Traversing the source system.
Identifying content nodes, such as files or Web pages.
Invoking iFilters to read those content nodes.
Retrieving any system-level metadata, such as permissions, and default properties such as Title.
Returning streams of content and metadata to the indexing engine.

Protocol Handler Characteristics

The various protocol handlers necessarily exhibit different characteristics and behaviors because the corresponding source systems are very different:

Web protocol handler. The Web protocol handler makes HTTP requests of the start addresses in a content source. It then traverses Web sites by following hyperlinks on Web pages. The Web protocol handler does not retrieve access control lists, so any content that indexed will not be security-trimmed at query time.
Returning streams of content and metadata to the indexing engine.

Protocol Handler Characteristics

The various protocol handlers necessarily exhibit different characteristics and behaviors because the corresponding source systems are very different:

SharePoint protocol handler. The SharePoint protocol handler enumerates the content to index by invoking the SiteData.asmx Web service. The Web service returns a list that contains content nodes. The SharePoint protocol handler then makes HTTP requests for each node. Access control lists are also returned, so full security trimming can occur for SharePoint-based content.

Search Architecture in Microsoft Office SharePoint Server 2007 - TechNet Articles - United States (English) - TechNet Wiki

Table of Contents

Indexing and Search Architecture

Index Engine

Query Engine

Server Roles

Indexer Role

Query Server Role

Web Server Role

Web Server Role

Database Server Role

Indexing Processes

Protocol Handlers

Protocol Handlers

Protocol Handler Characteristics

Protocol Handler Characteristics

iFilters

Word Breakers and Stemmers

Word Breakers in the Indexing Process

Word Breakers at Query Time

Stemmers

Stemmers in the Indexing Process

Stemmers at Query Time

Enabling Language-Specific Stemmers

32-Bit and 64-Bit Architectures

Query Servers and Web Servers

Indexers

Index Management and Propagation

Master and Shadow Indexes

Continuous Propagation from Indexer to Query Servers

Query Processes