Common programming Concepts: June 2012

Thursday, 14 June 2012

What is XML DataType in SQL DataType ?

XML has been widely adopted as a platform-independent mechanism for representing data, and is also commonly used to exchange data between disparate and loosely-coupled systems, such as B2B applications and workflow solutions. More recently, XML has been used to represent semi-structured (as well as unstructured) data such as documents and emails. If information in these models has to be queried, then XML is probably the simplest way to represent such information.

For example, if documents are represented in XML, it is very easy to write a DOM or XPATH query to extract the contents of a section titled, for example, "Recipe". Many applications also require other pieces of information that typically reside in a database. For example, consider an application that allows a user to personalize the look and feel of an application. It is not uncommon to store the preferences of the user against the user record in the database. XML lends itself well to storing such unstructured information to cope with an ever expanding list of user preferences.

When we store this kind of semi-structured or unstructured information as XML in the database, it is not always feasible to extract the content to the application tier and then process the XML using XML parsers. If the database platform is able to provide native XML processing capabilities then that is a massive advantage. In that way we can also take advantage of other database capabilities such as query optimization, indexing, backups and, most importantly of all, interoperability with relational data.

The XML Data Type

The XML data type is a built-in data type similar to other built-in types in SQL Server 2005. You can define XML columns, XML variables, XML parameters for procedures and functions, and can also return XML as return values from functions. Internally, the XML data type is stored in a proprietary binary format. The maximum permissible size of the XML content is 2GB.

Defining XML Variables and Columns

To declare a variable of type XML is as simple as this:

DECLARE @xmlDoc XML

In order to create an XML column (using untyped XML – more on this later), we could use the following:

CREATE TABLE rssFeeds
(
  feedName VARCHAR(20) NOT NULL PRIMARY KEY,
  generatedDate DATETIME DEFAULT GETDATE(),
  feedXML XML
)

Loading Data into XML Columns and Variables

Having defined an XML variable or column, we can store information in it via the following methods:

Assigning a string to the XML variable / column
Using the output of the SELECT…FOR XML clause
Using bulkload

Simple String Assignment

We can load a string into our @xmlDoc variable as follows:

SET @xmlDoc = '<contact><name>Bob Smith</name></contact>'
SELECT @xmlDoc

Note that string content is implicitly converted to the XML type. You can also explicitly convert the string to XML as shown:

SET @xmlDoc = CAST ('<contact><name>Bob Smith</name></contact>' AS XML)
SELECT @xmlDoc

In each example, we then select this XML content. If you run this example in SQL Server Management Studio with the "results to grid" option enabled, the XML output will be as shown below:

Notice that the XML content is hyperlinked and if you click this link, a new window opens and shows the entire XML document. This is a great improvement over SQL Server 2000 Query Analyzer, which would show the results of XML queries as a single line of output and also truncate the results if the result size is greater than 8192 bytes.

We can insert regular string content into the feedXML column of our rssFeeds table using the following method:

INSERT INTO rssFeeds (feedName, feedXML) VALUES
  (‘MSDN’, ‘<feedXML/>’)

Using FOR XML

In the following example, we initialize the @xmlDOC variable using a SELECT…FOR XML statement:

USE AdventureWorks
GO
DECLARE @xmlDoc XML
SET @xmlDoc =
  (
  SELECT name, salesYTD FROM sales.salesterritory
  FOR XML AUTO, ELEMENTS
  )
SELECT @xmlDoc

The output of the above command is as follows (only partial results are shown. To see the full results, you need to click on the hyperlink in Management Studio):

Streaming results from FOR XML statements directly into XML variables is a great improvement over SQL Server 2000 (which did not allow this). Also note that the above example returns an XML fragment and not an XML document. If you need to return an XML document, you can use the new ROOT directive in the FOR XML command as shown:

USE AdventureWorks
GO
DECLARE @xmlDoc XML
SET @xmlDoc =
  (
  SELECT name, salesYTD FROM sales.salesterritory
  FOR XML AUTO, ROOT ('territories'), ELEMENTS
  )
SELECT @xmlDoc

Executing the above command produces the following output (again, only partial results are shown):

Bulkloading XML data using the OPENROWSET Function

SQL Server 2005 provides extensions to the OPENROWSET function to bulk load an XML document from the file system. The following example shows how to bulkload our @xmlDoc variable using this technique:

DECLARE @xmlDoc XML
SET @xmlDoc = (
  SELECT * FROM OPENROWSET (
    BULK 'C:\Temp\rss.xml', SINGLE_CLOB
  ) AS xmlData
)
SELECT @xmlDoc

The RSS.XML file that is referred to in the above query is a sample RSS XML that I generated from the MSDN site. To do this, you can query the MSDN RSS site at: http://msdn.microsoft.com/rss.xml and then save results as XML from Internet Explorer.

The output of the above command is as follows:

Similarly, the following code uses the OPENROWSET command to load an XML document into our feedXMLcolumn:

GO
INSERT INTO rssFeeds (feedName, feedXML)
SELECT 'MSDNRSS', xmlData FROM
  (
  SELECT * FROM OPENROWSET
  (BULK 'C:\Temp\rss.xml', SINGLE_CLOB)
  AS xmlData
  ) AS feed (xmlData)
GO
SELECT * FROM rssFeeds

The following output is generated:

Typing XML

Up to this point, the XML content that we have used is referred to as untyped XML. An untyped XML variable/column is one that is not associated with a schema. If we associate a schema with an XML type, it is then referred to astyped XML. Once an XML variable/column is typed using a schema, any data that is inserted into it is automatically validated against the schema, and SQL Server 2005 raises an error if the data does not conform to the defined schema rules.

It is usually recommended to associate schemas to XML content in SQL Server, since query optimization can use the information and also it provides better operational semantics. For example, we can perform arithmetic operations on numbers, but not strings. If no schema is attached to an XML, then all values are considered to be strings. MSDN provides further information about schemas and their maintenance.

Creating typed XML involves two basic steps: 1) Create an XML schema and 2) Register it in the database.

Creating a Schema

Using tools such as Visual Studio 2005 and XML Spy, creating a schema is pretty simple. The following link contains some excellent tutorials on this topic: Creating Schemas in Visual Studio. Using the tool of your choice, create the following schema definition:

<?xml version="1.0" encoding="utf-8"?>
<xs:schema
   id="Contacts"
   targetNamespace="http://www.srisamp.com/contacts.xsd"
   elementFormDefault="qualified"
   xmlns="http://www.srisamp.com/contacts.xsd"
   xmlns:mstns="http://www.srisamp.com/contacts.xsd"
   xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:complexType name="contact">
        <xs:sequence>
            <xs:element name="name" type="xs:string" />
        </xs:sequence>
    </xs:complexType>
    <xs:element name="contacts">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="contact" type="contact" />
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>

This schema defines an XML document of the form:

<contacts><contact><name></name></contact></contacts>

Save this schema in a file called ContactsSchema.xsd in the C:\Temp folder.

Registering a Schema in SQL Server 2005

You register a schema in SQL Server 2005 using the CREATE XML SCHEMA COLLECTION command. The following script loads our XML schema into the database.

IF EXISTS (SELECT * FROM sys.xml_schema_collections
      WHERE [name] = 'MyContacts')
   DROP XML SCHEMA COLLECTION MyContacts
GO
DECLARE @contactsSchema XML
SET @contactsSchema =
   (
      SELECT * FROM OPENROWSET
         (BULK 'C:\Temp\ContactsSchema.xsd', SINGLE_CLOB)
            AS xmlData
   )
CREATE XML SCHEMA COLLECTION MyContacts AS @contactsSchema
GO

In the following code, we type an XML variable to the MyContacts schema and then assign both valid and invalid content to the XML (by invalid, we mean that the content does not adhere to the schema):

DECLARE @xmlDoc XML (MyContacts)
BEGIN TRY
   -- This entry conforms to our schema rules
   SET @xmlDoc = '
      <contacts xmlns="http://www.srisamp.com/contacts.xsd">
         <contact>
            <name>Bob Smith</name>
         </contact>
      </contacts>
   '
   SELECT @xmlDoc
 
   -- This one does not conform to the schema
   SET @xmlDoc = '
      <contacts xmlns="http://www.srisamp.com/contacts.xsd">
         <contact>
            <names>Bob Smith</names>
         </contact>
      </contacts>
   '
   SELECT @xmlDoc
END TRY
BEGIN CATCH
   PRINT 'Error at line '+CAST(error_line() AS VARCHAR)+'. Message: '+error_message()
END CATCH

Note how we have declared the XML variable. The schema reference is specified in parentheses after the variable. We have also used the new TRY…CATCH error handling syntax of SQL Server 2005 to capture the error and display the corresponding message.

What is covering index ?

The term covering index does not mean a separate kind of index having a different internal structure. Rather, this term is used to describe a certain technique that is used to improve performance. It is discussed here because, as will be shown later, for most practical purposes, a clustering index can be thought of as a covering index.
Let's consider a simple but typical example. Suppose that we need to join two tables—Master, defined as (MasterId int primary key, Data int), and Detail, defined as (DetailId int primary key, Data int, MasterId int)—and select columns (MasterID, Detail.Data) where Master.Data satisfies a certain condition. The condition can be anything that is built directly on the value of Master.Data—for example, Data = @data, Data between @start and @end, or Data in (1, 2, 3). Let's also assume that, besides primary key indexes, the Master table has an index idxMaster built on (Data, MasterID), and the Detail table has an indexidxDetail built on (MasterID). The query plan will look as shown in Figure

What does that mean, exactly? Let's analyze it step by step:

The query engine searches idxMaster, to find the entries that satisfy our condition for Master.Data, and it builds a list of corresponding Master.MasterIDvalues. This is fast, because:
- Index pages are sorted by Master.Data, so the engine can jump directly to leaf index pages that satisfy our condition.
- Those leaf index pages already contain values of Master.MasterId, so there is no need to jump somewhere else to read them.
The engine selects entries from idxDetail having values of Detail.MasterID that match the list of Master.MasterID built during Step 1. This can be done by simply looping through the list of Master.MasterID and looking up each corresponding entry in idxDetail, or by using more complicated hash or mergealgorithms. The query engine determines which algorithm to use in each particular case, based on the relative cost of each algorithm. This, in turn, depends on the following criteria:
- Selectivity of the index
- Estimated number of lookups
- Estimated number of rows to find.
For the low relative values of the second and third criteria, the loop algorithm is preferred, whereas for higher values, the hash/merge algorithms become more efficient. The output of this step is a list of (MasterID, Detail:FID:PN:RN) values (see Figure 1).
A list of heap pointers (FID:PN:RN), also known as bookmarks, is used to read randomly located heap data pages of the Detail table, and to retrieve values of the Data field in order to compile the requested result set (MasterID, Detail.Data). This step is called bookmark lookup, and, in our case, it is the most expensive one. The cost of bookmark lookup can be anywhere from 90 to 99 percent of the total query cost. This means that eliminating this step can improve the performance ten-fold or even one-hundred-fold. The cost of bookmark lookup depends on the fragmentation of data pages, the number of data pages to read, that availability of read-ahead optimization, the IO-to-CPU cost ratio, and other factors.

Please note that, in this example, bookmark lookup has only been performed on the Detail table, not on the Master table. This is because the required output fieldMasterID was a part of the index. This leads us to the definition of covering index: A covering index is the index that contains all output fields required by the operation performed on that index. Please note that if the Master table contained more fields, and at least one of these fields was present in the SELECT clause, the query engine would have to perform bookmark lookup on the Master table, and therefore idxMaster could not serve as a covering index anymore. That is why it is not possible to tell whether the index is covering without taking into consideration the query it is used in, and even the particular execution plan.

It is very easy now to devise a way to improve the performance of our query. Adding Detail.Data as a second field to idxDetail eliminates the need for bookmark lookup and considerably improves the performance.

Good candidates for covering indexes are junction tables (tables that exist to resolve many-to-many relationships). In the simplest case of a many-to-many relationship between T1 and T2, it is enough to have two indexes, (T1_Id, T2_Id) and (T2_Id, T1_Id), in order to save the query engine from ever doing bookmark lookups on this table.

What is GETDATE() ,SYSDATETIME(), GETUTCDATE(), SYSUTCDATETIME() ?

GETDATE() : Its precision is in milliseconds.
SYSDATETIME() : Its precision is in nanseconds.
GETUTCDATETIME() : Returns data as UTC time ( Coordinated Universal Time ) and its precision is in milliseconds.
SYSUTCDATETIME() : Returns data as UTC time ( Coordinated Universal Time ) and its precision is in nanoseconds..

What is Bookmark Lookups in SQl Sever ?

One of the major overheads associated with the use of non-clustered indexes is the cost of bookmark lookups. Bookmark lookups are a mechanism to navigate from a non-clustered index row to the actual data row in the base table (clustered index) and can be very expensive when dealing with large number of rows.

When a small number of rows are requested by a query, the SQL Server optimizer will try to use a non-clustered index on the column or columns contained in the WHERE clause to retrieve the data requested by the query. If the query requests data from columns not contained in the non-clustered index, SQL Server must go back to the data pages to obtain the data in those columns. It doesn’t matter if the table contains a clustered index or not, the query will still have to return to the table or clustered index to retrieve the data.

Bookmark lookups require data page access in addition to the index page access needed to filter the table data rows. Because this requires the query to access two sets of pages instead of only one, the number of logical READS performed by the query will increase. If the data pages are not in the buffer cache, additional I/O operations will be required. And in the case of most large tables, the index page and the corresponding data pages are not usually located close to each other on the disk.

These additional requirements for logical READS and physical I/O can cause bookmark lookups to become quite costly. While this cost may be acceptable in the case of small result sets, this cost becomes increasingly prohibitive in the case of larger and larger result sets. In fact, as the result sets become larger and larger, the optimizer may consider the costs of the bookmark lookups to be too much and discard the non-clustered index and simply perform a table scan instead.

Example of a Bookmark Lookup
SET STATISTICS PROFILE ON
GO

USE pubs
GO

--Find phone number for White, Johnson
SELECT phone
FROM dbo.authors
WHERE au_lname = 'White'
AND au_fname = 'Johnson'
GO

Execution Plan (abridged)
Rows Executes StmtText
----------- ----------- ----------------------------------------------------------------------------------
1 1 SELECT [phone]=[phone] FROM [dbo].[authors]
1 1 |--Bookmark Lookup(BOOKMARK:([Bmk1000]), OBJECT:([pubs].[dbo].[authors]))
1 1 |--Index Seek(OBJECT:([pubs].[dbo].[authors].[aunmind

Because both the au_lname and the au_fname are contained in a non-clustered index, the optimizer can use the non-clustered index to filter the rows contained in the table to return only the phone number requests. However, because the phone column in the authors table is not contained in the index or another non-clustered index, the optimizer must return to the authors table in order to return the matching phone number creating a bookmark lookup.

Finding the offending column(s)
In order to resolve the bookmark lookup, you must find the column or columns that cause the bookmark lookup. To find offending columns look for the index usage in the execution plan to find what index is utilized by the optimizer for the query.

Execution Plan (abridged)
StmtText
----------- ----------- ----------------------------------------------------------------------------------
|--Index Seek(OBJECT:([pubs].[dbo].[authors].[aunmind]), SEEK:([authors].[au_lname]='White' AND [authors].[au_fname]='Johnson') ORDERED FORWARD)

In this case we see that the authors.aunmind index is being used by the optimizer for the query. A quick check of the columns included in the index using sp_helpindex on the authors table will show that the index consists of the au_lname and au_fname columns.

Index_name index_description index_keys
aunmind nonclustered located on PRIMARY au_lname, au_fname

A review of the execution plan OutputList column reveals that the phone column is only remaining column being requested by the query.

Execution Plan (abridged)
OutputList
----------------------------------------------------------------------------------
[authors].[phone]

Since the phone column is not in the index, you can deduct that the phone column is the offending column in this case.

Resolving bookmark lookups
Once you discover the columns responsible for a bookmark lookup, you will need to consider one of four methods that are available to resolve the bookmark lookup.

Create a covering index
Remove the offending column
Convert a non-clustered index into a clustered index

Create a covering index
Given the example listed earlier in this section, if the following covering index had been created, the result would be the removal of the bookmark lookup from the execution plan.

CREATE NONCLUSTERED INDEX ncl_authors_phone ON authors(au_lname, au_fname, phone)
GO

Execution Plan (Abridged)
SELECT [phone]=[phone] FROM [dbo].[authors] WHERE [au_lname]=@1 AND [au_fname]=@2 |--Index Seek(OBJECT:([pubs].[dbo].[authors].[ncl_authors_phone]), SEEK:([authors].[au_lname]=[@1] AND [authors].[au_fname]=[@2]) ORDERED FORWARD)

Remove the offending column
In the simple query below, the developer returned all the columns from the authors table when all the query asked for was the ID of the author.

SET STATISTICS PROFILE ON
GO

USE pubs
GO

--Find ID number for White, Johnson
SELECT *
FROM dbo.authors
WHERE au_lname = 'White'
AND au_fname = 'Johnson'
GO

Execution Plan (abridged)
StmtText
----------- ----------- ----------------------------------------------------------------------------------
SELECT * FROM [dbo].[authors] WHERE [au_lname]=@1 AND [au_fname]=@2 |--Bookmark Lookup(BOOKMARK:([Bmk1000]), OBJECT:([pubs].[dbo].[authors])) |--Index Seek(OBJECT:([pubs].[dbo].[authors].[aunmind]), SEEK:([authors].[au_lname]='White' AND [authors].[au_fname]='Johnson') ORDERED FORWARD)

Removing the additional, unneeded columns and only returning the au_id column will remove the bookmark lookup in this case

SET STATISTICS PROFILE ON
GO

USE pubs
GO

--Find ID number for White, Johnson
SELECT au_id
FROM dbo.authors
WHERE au_lname = 'White'
AND au_fname = 'Johnson'
GO

Execution Plan (abridged)
StmtText
----------- ----------- ----------------------------------------------------------------------------------
SELECT [au_id]=[au_id] FROM [dbo].[authors] WHERE [au_lname]=@1 AND [au_fname]=@2 |--Index Seek(OBJECT:([pubs].[dbo].[authors].[aunmind]), SEEK:([authors].[au_lname]=[@1] AND [authors].[au_fname]=[@2]) ORDERED FORWARD)

Bookmark lookups are often caused by additional columns being returned in the data set “just in case” they are needed at a later date. Developers should strive to only include columns in their result sets which are needed for the defined query requirements. Additional columns can always be added at a later date.

Convert a non-clustered index into a clustered index
When developers are faced with bookmark lookups that cannot be removed with the other choices described above, an alternative choice would be to convert an existing index being used by the query into a clustered index. Converting an existing index into a clustered index will place all the columns of the table in the index and prevent the need for a bookmark lookup.

SET STATISTICS PROFILE ON
GO

USE pubs
GO

--Find information for White, Johnson
SELECT fname + ' ' + lname + ' Hire Date: ' + CAST(hire_date AS VARCHAR(12))
FROM dbo.employee
WHERE emp_id = 'PMA42628M'
GO

Execution Plan (abridged)
StmtText
----------- ----------- ----------------------------------------------------------------------------------
SELECT fname + ' ' + lname + ' Hire Date: ' + CAST(hire_date AS VARCHAR(12)) FROM dbo.employee
WHERE emp_id = 'PMA42628M'
|--Compute Scalar(DEFINE:([Expr1002]=[employee].[fname]+' '+[employee].[lname]+' Hire Date:
'+Convert([employee].[hire_date])))
|--Bookmark Lookup(BOOKMARK:([Bmk1000]), OBJECT:([pubs].[dbo].[employee]))
|--Index Seek(OBJECT:([pubs].[dbo].[employee].[PK_emp_id]),
SEEK:([employee].[emp_id]='PMA42628M') ORDERED FORWARD)

To resolve the bookmark lookup, the developer can change the existing clustered index on the lname, fname, and minit columns into a non-clustered index.

--change original clustered index into a non-clustered index
DROP INDEX employee.employee_ind
GO

CREATE INDEX employee_ind ON employee(lname,fname,minit) 
GO

Once the clustered index has been changed into a non-clustered index, a new clustered index can be built on the emp_id column to resolve the bookmark lookup. In this particular case the emp_id is the PRIMARY KEY of the table, so instead of an index, the developer needs to recreate a clustered PRIMARY KEY.

--Create new clustered index
--Drop CONSTRAINT
ALTER TABLE employee
DROP CONSTRAINT PK_emp_id
GO

--Recreate CONSTRAINT
ALTER TABLE employee
ADD CONSTRAINT PK_emp_id PRIMARY KEY  CLUSTERED  (emp_id) 
GO

--Test removal of bookmark lookup 
--Find information for White, Johnson
SELECT fname + ' ' + lname + ' Hire Date: ' + CAST(hire_date AS VARCHAR(12)) 
FROM dbo.employee
WHERE emp_id = 'PMA42628M'
GO

Execution Plan (abridged) 
StmtText                                                                             
----------- ----------- ---------------------------------------------------------------------
SELECT fname + ' ' + lname + ' Hire Date: ' + CAST(hire_date AS VARCHAR(12))  FROM dbo.employee
                                                                 WHERE emp_id = 'PMA42628M'
  |--Compute Scalar(DEFINE:([Expr1002]=[employee].[fname]+' '+[employee].[lname]+' Hire Date: 
                                                   '+Convert([employee].[hire_date]))) 
       |--Clustered Index Seek(OBJECT:([pubs].[dbo].[employee].[PK_emp_id]), 
                           SEEK:([employee].[emp_id]='PMA42628M') ORDERED FORWARD)

While converting a non-clustered index into a clustered index is a possible solution to bookmark lookups, often applications depend on the current clustered index and this solution will be almost impossible to implement in the real world.