This article describes or explains the following:

Indexing and Search Architecture

A search solution based on Office SharePoint Server 2007 is comprised of two main components, namely the indexing engine and the query engine. In brief, the indexing engine is responsible for crawling and indexing content that may be stored in a variety of formats and locations in your corpus, while the query engine provides the capability to search the indexed information.

Index Engine

The indexing engine provided by Office SharePoint Server 2007 is capable of crawling a variety of content sources, such as Web sites, SharePoint sites, Exchange Server Public Folders, line-of-business data, and file shares. The indexing engine retrieves the definitions of the content sources from the search configuration data. You have control over defining the content sources to be crawled.
The indexing engine provides the base logic for the indexing process, and loads specific components called protocol handlers to connect to and crawl the different types of content sources. Protocol handlers, in turn, load additional components called iFilters to read the contents of specific file types.
The indexing engine maintains a file-based index, which contains the indexed content. Furthermore, the indexing engine also maintains managed properties (in what is called as the property store or search schema) and scope definitions in the search and configuration databases managed by SQL Server. 

Query Engine

Put simply, a Web server initiates each query by collecting terms from the user, and then contacts the query engine to search the full-text index for items that contain the searched-for terms. The results are supplemented with keywords, best-bets, and managed properties from the search configuration database, managed by SQL Server. If the query consists of only a property filter, the Web server needs only to contact the database server, and does not contact the query engine.
Queries are initiated through the Search object model or the Search Web service on Web servers.
Additional components, such as word breaks and stemmers are used throughout the entire process. These components will be discussed in detail later in this paper.
You can physically separate the indexing engine from the query engine by implementing specific roles in your server farm. You can also physically separate the indexing and query engines from the Web servers that expose the search object model and the search Web service.

Server Roles

When you create solutions with SharePoint technologies, you must be aware of the roles that servers can take within your server farm. For search, the roles that you should be aware of are the indexer role, the query server role, the Web server role, and the database server role.

Indexer Role

The indexer role provides the indexing services, such as crawling content, managing crawl schedules, and defining crawl rules. One of the main tasks assigned to the indexer is to crawl content sources and index the information that is stored there. A content source is simply a specification of the type of system and location to be crawled, along with at least one start address. In Office SharePoint Server 2007, you can specify up to 500 start addresses per content source and, furthermore, a Shared Service Provider can define up to 500 content sources.
An indexer is characterized by the following requirements:
The processor is most commonly the first bottleneck for an indexer, especially if sufficiently high processor power has not been provided. You should monitor processor utilization at indexing time to determine whether you need to add more processing power to an indexer. However, if you have provided a high processor power, you are most likely to experience network latency problems between the indexer and the content.

Query Server Role

The query server role runs queries over the full-text index. Query servers are managed at the farm level.
If you physically separate the query server role from the indexer onto one or more servers, then the full-text index is propagated from the indexer to all query servers in the farm. Propagation occurs continuously while content is being added to the index on the indexer and you are not required to configure or administer the propagation. Most queries will require results to be returned from the Query Server; the only exception is when the user has issued a query that only consists of a property filter. In that case, the query can be satisfied by the database server role alone.
A query server is characterized by the following requirements:

Web Server Role

The Web server role responds to search queries for users and applications. The Web server collects query terms from the user, either through built-in Web Parts, custom Web Parts, or from custom applications. Based on the information collected, the Web server is responsible for formulating een invalidated because a crawl has discovered updated information.
  • Network. A query server makes use of the network primarily at query time. It receives search requests from Web servers, and sends results back over the network. Network resources are also used at indexing time when an Index Server propagates small index fragments to query servers.
  • Web Server Role

    The Web server role responds to search queries for users and applications. The Web server collects query terms from the user, either through built-in Web Parts, custom Web Parts, or from cuthe specific query. Depending on the contents of the query the Web server will contact query servers and the database server to retrieve the required results and access control lists. For example, if the query consists of only a property filter, the Web server needs only to contact the database server, and does not contact the query engine, whereas if the query contains keywords to be searched for, then both the query server and database server will be contacted.
    When all of the results and access control lists have been returned to the Web server, it security trims the results, based on the identity of the user who issued the query and the access control lists returned by the database server. After security trimming has taken place, the Web server presents the results, either by rendering them on Web pages or by returning the results to a calling application.

    Database Server Role

    The database server role performs search-specific actions that apply at configuration, indexing, and query time. All of the administrative search configuration settings are stored by the database server; these settings include content source definitions, crawl rules, and scope definitions.
    In addition to storing configuration data in the search configuration database, the database server also stores data that is retrieved from the crawl processes. Specifically, when managed property values and access control lists are retrieved from content sources, their values are stored in the search database. In addition, when a query is issued by a user, the Web server contacts the database server to retrieve managed property values and access control lists, based on the data returned from the query server.

    Indexing Processes

    The indexing process consists of the following general steps:
    1. The indexer retrieves the start addresses of content sources.
    2. The indexer invokes a protocol handler to connect to and traverse the content source.
    3. The protocol handler identifies content nodes, such as files and Web pages.
    4. The protocol handler retrieves system-level metadata and access control lists (if access control lists are available).
    5. The protocol handler invokes the iFilter associated with the content node type.
    6. The iFilter retrieves content and metadata from the content node.
    7. Content and metadata are parsed by the word breaker and are added to the full-text index.
    8. Metadata and access control lists are added to the search database.

    Protocol Handlers

    The crawl process requires protocol handlers to connect to content sources and iFilters to access the data stored within files that are located at the content source. 

    Protocol Handlers

    In general, protocol handlers are responsible for:

    Protocol Handler Characteristics

    The various protocol handlers necessarily exhibit different characteristics and behaviors because the corresponding source systems are very different:

    Protocol Handler Characteristics

    The various protocol handlers necessarily exhibit different characteristics and behaviors because the corresponding source systems are very different: