Tuning Spatial Point Data Queries in SQL Server 2012
Written by: Ed Katibah, Milan Stojic, Michael Rys, Nicholas Dritsas
Technical reviewers: Chuck Heinzelman
Introduction
Spatial Point Data queries require particular tuning efforts to enhance performance and improve overall application through put. SQL Server 2012 introduces a few key new features and improvements to assist you in that goal.
Below, we go over a few key best practices and suggestions on how to achieve that.
Optimize the primary key clustered index
To create an index on a spatial column, the table must provide a primary key. SQL Azure extends this requirement by requiring that each table have a clustered primary key.
Spatial indexing basics
Spatial query execution with an index contains two parts: primary filter (spatial index lookup) and secondary filter (original spatial predicate). So with the queries like “STDistance() < x”, SQL Server will:
- identify a candidate set of cells
- run an index seek over the spatial index (primary filter)
- join candidate values with the base table to get the actual spatial objects
- filter any false positives by performing the spatial operation (secondary filter)
The secondary filter can be avoided by using Filter(@)=1 operation instead of STIntersects(@x)=1. The query performance can be much better but false positives are possible in the result. If the spatial query is not too large, for HHHH spatial index, tolerance will be around 100-200m when using the Geography data type.
STDistance(@x) < @range operator can be replaced with Filter(@x.BufferWithTolerance(@range,1,1)) =
If returning some false positives is not an option, IO performance of the base table can still be optimized.
Any spatial index is already optimized to minimize IO. Cells are places in a specific pattern so that spatially closer cells sit physically close to each other. The following picture illustrates the cell pattern used (a Hilbert space-filling curve is used in SQL Server).
Therefore, the reading pattern for an index seek over spatial index might look like:
However, as the join to the base table is needed to get actual spatial objects, the IO pattern for the clustered index seek (the base table) might look like:
This IO pattern will result in reading many more pages from the disk and will have high overall impact on spatial query performance.
So, in order to get better IO performance for the clustered index seek, the Primary Key needs to be correlated with spatial index cell pattern. This can be achieved by adding point coordinates components to the clustered index.
For our sample data, here is how the original table containing the point column was specified:
CREATE
TABLE [dbo].[Points](
[id] [int] NOT
NULL,
[type] For our sample data, here is how the original table containing the point column was specified:
CREATE
TABLE [dbo].[Points](
[int] NOT
NULL,
[geo] [geography]
NULL
PRIMARY
KEY CLUSTERED
(
[id] ASC,
[type] ASC
));
Spatial indexes reference the primary key to access the base table rows. Query performance can often be significantly enhanced by creating a clustered index on the primary key which incorporates the individual ordinates of the underlying point coordinate. This technique allows the primary key to be “spatially” ordered thereby minimizing the number of disk seeks required to fetch the rows of base table data requested by the spatial index.
On occasion, the individual point coordinate ordinates can be found in their own columns in the base table along with the column of points (spatial object). When this is the case, the individual point ordinates can be referenced directly in the clustered primary key DDL. When this is not the case, the table schema can be expanded to include two persisted computed columns which contain the individual ordinates of the underlying point coordinates.
It is important to note that the new point ordinate columns must be the first two columns referenced in the clustered primary key. For geometry coordinates, they should be in the x, y order. For geography data they should be in latitude, longitude order.
For the sample data table, it is necessary to expand the schema as follows, creating two new computed persisted columns using spatial methods and redefining the clustered primary key index with these two new columns (note the use of the built-in Lat and Long methods to extract the individual ordinates):
CREATE
TABLE [dbo].[Points](
[id] [int]
NOT NULL,
[type] [int]
NOT NULL,
[geog] [geography] NOT
NULL,
[geo_lat] AS ([geog].[Lat]) PERSISTED
NOT NULL, -- *
[geo_lon] AS ([geog].[Long])
PERSISTED NOT
NULL, -- *
PRIMARY
KEY CLUSTERED
(
[geo_lat] ASC,
[geo_lon] ASC,
[id]
ASC,
[type]
ASC
));
* not null due to use in primary key
NOTE: Persistence of the x,y / lat,long points for the spatial object as the first columns in the clustered key adds 16 bytes to the key, which is then duplicated in both the base table as well as the spatial index itself, as well as every non-clustered index added to the table. The regular guidance for clustered index keys is to keep them as small as possible. Even though this could potentially improve spatial index performance, users should be aware of the potential negative consequences.
Optimize the spatial index
The default spatial index has traditionally been recommended with grids all set to MEDIUM based on a general spatial workload. This was original index definition:
CREATE
SPATIAL INDEX [table_geog_sidx]
ON [dbo].[table]([geog])
USING GEOGRAPHY_GRID
WITH
(GRIDS
=(LEVEL_1 = MEDIUM,
LEVEL_2
= MEDIUM,
WITH
(GRIDS
=(LEVEL_1 = MEDIUM,
LEVEL_2
= MEDIUM,
LEVEL_3
= MEDIUM,
LEVEL_4
= MEDIUM),
CELLS_PER_OBJECT
= 16;
Spatial query performance can often be improved significantly by selecting spatial index options which differ from default settings for specific spatial shapes and distributions. In the case of point data, it has been
found in most, if not in all, cases that spatial indexes with all grid levels set to HIGH outperform other configurations. Since we are dealing with point data, the CELLS_PER_OBJECT setting is irrelevant and can be set to any legal value (1-8192) without
effect. Here is the spatial index reconfigured to grid levels with the optimal setting for points, all HIGH:
CREATE
SPATIAL INDEX [table_geog_sidx]
ON [dbo].[table]([geog])
USING GEOGRAPHY_GRID
WITH
(GRIDS
=(LEVEL_1 =
HIGH,
LEVEL_2
= HIGH,
LEVEL_3
= HIGH,
LEVEL_4
= HIGH),
CELLS_PER_OBJECT
= 16;
Consider using a spherical Earth model if using geography data type
If you are using SQL Azure or SQL Server 2012 and your accuracy requirements for the spatial results do not require that an ellipsoidal model be used, additional query performance can potentially be gleaned by using a spherical model as the basis for the underlying spatial coordinate system due to the simpler mathematics involved in calculations. Additionally, the STDistance(), STLength() and ShortestLineTo() methods are optimized to run faster on a sphere than on an ellipsoid.
In SQL Server 2012 and SQL Azure, there is a new entry for a unit sphere in the sys.spatial_reference_systems view. This new entry has a spatial reference identifier (SRID) = 104001 and is listed as a unit sphere.
The native units of measure output when using the unit sphere are radians. To compute measures in real world values, such as length and area in meters, simply multiply results by the radius of the desired output sphere as follows:
- For linear measure (length, distance, buffer distance): length * (sphere radius)
- For area measure: area * (sphere radius) * (sphere radius)
In the case of our sample data, the original coordinate system was specified as World Geodetic Reference System of 1984 (WGS84). In SQL Server and SQL Azure, this is identified as spatial reference identifier (SRID) 4326. This is the most common coordinate system for use with the geography data type (most commercial spatial data uses these coordinate systems, as do GPS receivers). For many web mapping programs which reference WGS84 (Bing Maps, Google Maps, etc.), a radius of 6,378,137 meters is used for their underlying spherical spatial reference system (“Spherical Mercator”).
To update the SRID for each point object to the unit sphere reference system in the sample spatial table, the following T-SQL can be used (note the use of the STSrid method):
UPDATE Points
SET geog.STSrid
= 104001
To update the SRID for each point object to the unit sphere reference system in the sample spatial table, the following T-SQL can be used (note the use of the
STSrid method):
UPDATE Points
SETPerformance testing
To test these tuning recommendations, a customer-driven scenario was used along with their data. To provide an idea of the scope of the query, the following stored procedure is provided as an example:
/****** Object: StoredProcedure [dbo].[TEST] Script Date: 12/13/2011 3:14:50 PM ******/
SET
ANSI_NULLS ON
GO
SET
QUOTED_IDENTIFIER ON
GO
ALTER
PROCEDURE [dbo].[TEST]
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
--set statistics time on
--set statistics IO on
SET NOCOUNT ON;
DECLARE @Location geography,
@continent
integer
SELECT @Location
= CI.city_center_coordinates,
@continent
= CO.continent_id
FROM dbo.geo2_city
AS CI
INNER
JOIN dbo.geo2_country
AS CO ON CO.country_id
= CI.country_id
WHERE CI.city_id
= 9395
AND CI.rec_status
> -1
AND CO.rec_status
> -1
SET @location = geography::Point(@location.Lat, @location.Long,
4326)
SELECT GI.geo_id, GI.geo_info.STDistance(@Location)
FROM
db_core.dbo.geo_informations2
AS GI with(nolock,
INDEX(SIndx_geography_informations2))
INNER JOIN dbo.product_hotels
(nolock)
AS PH
ON (PH.hotel_id
= GI.geo_id)
WHERE GI.geo_type
= 7
AND PH.rec_status
= 1
AND PH.region_id
= 2
AND GI.geo_info.STDistance(@Location)
< 10000
END
For discussion clarity, the following T-SQL, pre-optimization, captures the core spatial query used in the stored procedure (the SQL Server spatial methods are highlighted). This corresponds to the T-SQL illustrated in the light blue box, above.
SELECT p.geo_id, p.geo_info.STDistance(@Location)
FROM Points
AS p WITH(nolock,
INDEX(SIndx_geography_informations)) –- original, non-optimized index
WHERE GI.geo_info.STDistance(@Location)
< 10000 -– 10KM (10,000 meters)
When the coordinate system was changed from SRID=4326 to SRID=104001, it was necessary to rewrite the query as follows to accommodate the coordinate system units change and assure that the new, optimized spatial index was used:
SELECT GI.geo_id, GI.geo_info.STDistance(@Location)
* 6378137
FROM Points_UnitSphere AS GI
WITH(nolock,
INDEX(geo_info_HHHH_sidx)) –- optimized index
WHERE GI.geo_info.STDistance(@Location)
* 6378137 <
10000 – 10KM
Adding the multiplication in the WHERE clause causes the spatial index to be ignored in the query plan (in fact, the query compiler complains about the use of the spatial index hint, when written in this fashion). To alleviate this issue, the query can be rewritten as follows:
SELECT p.id, p.info.STDistance(@Location)
* 6378137
FROM Points_UnitSphere AS p
WITH(nolock,
INDEX(geo_info_HHHH_sidx))
WHERE GI.geo_info.STDistance(@Location) <
10000/6378137
-- 10KM
Quick summary of the potential techniques to improve spatial point data performance:
- For point data, create an “all HIGH” manual grid spatial index
- Use the Filter operation to avoid secondary filter, or
- Create a clustered primary key on point table using point coordinate components
- For the Geography data type, using a spherical Earth model instead of ellipsoidal model
- Use the spatial_window_max_cells query hint to fine tune query performance
Prior to optimization, use of the SQL Server STDistance method in the SELECT and WHERE clauses returned an average performance of 30 milliseconds.
After optimization (new table with point clustered primary key, new point index and use of the sphere-optimized methods), use of the SQL Server STDistance method in the SELECT and WHERE clauses returned an average performance of 18 milliseconds.
Description |
Execution Time in ms (average*) |
Original Spatial Query |
30 |
Optimized Spatial Query |
18 |
* each query for this test was executed 1,000 times for determination of the average time