Using the XML data type in SQL ServerTable of contents
While working on the web based version of ALMA, I have had to make various database redesign decisions. This article documents the thought process behind one of these decisions in the hope that it will help other developers who may need to store similar data sets into an SQL server database. The article will also help other developers on the ALMA team understand the rational behind the decisions taken.
ALMA is a clinic letters production system and a complete workflow tool which helps in the management and sign-off of letters and clinical documentation. The software is currently used by 21 departments covering both surgical and medical specialties. The initial version, written over five years ago, was designed as a traditional client server application. The migration to a web based front-end is giving us the opportunity to rethink and improve on the initial database design. The redesign is important as the system scales up to producing over a million letters a year across all departments and specialities.
The database schema includes many tables all focused around the main documents table. The documents table stores all the meta data associated with a particular instance of a letter. When the clinic ran, the date the letter was dictated and the date the letter was typed are all examples of meta data items stored in the documents table. The table also stores a reference to who the letter is addressed to, who signed the letter and who approved the letter for publication.
Due to various restrictions the initial database design stored the whole clinical narrative of the letter as an unstructured blob of data into a VARCHAR(MAX) field. However, the letters content isn’t totally unstructured. Many letters have various sub-headings before the clinical narrative. These sub-headings should really be stored as discreet data items in the database.
As a teaching hospital junior doctors need to do a lot of research projects. Having a feature which facilitates searching all letters and patients with a particular sub-heading value would be of great use.
The problem is compounded by the large number and variation of sub-heading categories depending on which specialty the clinic letter is written for. After a quick audit upwards of 30 sub-heading categories over 4 specialties were found. It became clear that whatever database design was chosen the system had to be flexible enough to cope with many different heading types.
Ideally we wanted a database which wouldn’t require a schema modification every time a new sub-heading category was specified by a consultant. This requirement ruled out having each sub-heading item as a discreet data element in the database. As an organisation we have to quickly respond to new business requirements. Having to involve the IT department to create a new data element in the database; and perhaps modify the web applications logic wasn’t an option.
Let’s look at some examples from real clinic letters to give you an idea of the diversity of options.
Diagnosis Right CAT in April/May 2013 Complains of right otitis externa Plan Microsuction Keep ears dry Review 4 weeks to be seen by Mr Bradley
The example above shows three sub-headings diagnosis, plan and review. Note that some sub-headings like diagnosis can have multiple items or diagnosis’s.
The second example below demonstrates how some data items like Visual Acuity need to store two values. One for the right eye “RE” and one for the left eye “LE”
Diagnosis ? superficial corneal dystrophy Right amblyopia Visual Acuity RE: 6/12+2 LE: 6/4
Many sub-headings are common to all specialities, in fact some categories of heading like diagnosis should be mandatory on all clinic letters. There are specialty specific headings and depending on the patients condition there will be some headings which are specific to their care. The web application has to allow the secretary to be able to remove non-mandatory headings. It also needs to allow specific headings to be added on a letter by letter basis, depending on the patient. The business rules layer of the web application has to therefore be able to handle these variations and know how to store the data in the database.
Lets take a look at a couple of the ideas we considered and why we rejected them.
Key Value Pairs
Looking at the two examples you instantly think about a key value data store. Such data stores are used extensively in health informatics due to the complexity of clinical information systems. Many system suppliers desire the open-ended data structure a key value data store offers as it allows for future extension without modifying existing code or data.
It would be quite easy to have a table in SQL server which mimics a key value data store. You would only need 3 columns in the table. A reference to the DocumentID relating the entries back to the main documents table. A key column which would hold values like diagnosis, plan, review etc. The third column would be the values for each of the keys.
There are many articles on the internet which discuss the curse of trying to mimic a key value store in an SQL database. The main arguments against using this model within SQL are:
- Instead of a single insert, update, or delete per document / letter, there is one for each attribute (sub-heading) : creating a letter with 10 attributes involves inserting 10 rows; changing 10 attributes involves updating 10 rows; and of course destroying a letter with 10 attributes involves deleting 10 rows. All these SQL queries would have to be wrapped in a single transaction. On a heavily used system with hundreds of secretaries typing letters at once, the potential for deadlocks becomes very real.
- There is a certain amount of data duplication: not only is the DocumentID duplicated for every attribute, but the “key” part of the key-value pair (i.e. the attribute name such as diagnosis) takes up space.
- The “value” part of the key-value pair has to be the same data type, which will undoubtedly be a variable length character string (unless you do something really strange, and have a different table for each type). If you’re storing integers or dates, these have to be stored as strings, and must therefore be converted every time they are moved in or out of the database.
After reading these articles, we decide not to use the key value model.
NoSQL databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.
NoSQL databases are built to allow the insertion of data without a predefined schema. This makes it easy to make significant application changes in real-time, without worrying about service interruptions – which means development is faster, code integration is more reliable, and less database administrator time is needed.
So while a NoSQL database would probably be a good candidate for our requirements, it didn’t feel right introducing a whole new software dependency just to store the sub headings. The rest of the database schema would be held within SQL server so by introducing another database layer we would incur a systems administration overhead by having to monitor the additional service and backup the separate databases.
We also wanted to the make the whole project open source, by limiting the dependencies we are making it easier for other hospitals to adopt the system.
The XML data type
After reviewing many clinic letter examples, it became clear that the web application could easily encode the data as a structured XML file. This would allow for repeating items such as diagnosis. It would also allow the application to write and parse any number of sub headings depending on what the letters content required.
The code sample below shows how the XML might be coded for the data represented in example 1.
<alma> <diagnoses> <diagnosis>Right CAT in April/May 2013</diagnosis> <diagnosis>Complains of right otitis externa</diagnosis> </diagnoses> <plan> <action>Microsuction</action> <action>Keep ears dry</action> </plan> <review>4 weeks to be seen by Mr Bradley</review> </alma>
Assuming we have written the application logic to write and parse the XML data, lets see how the data would be stored within SQL Server.
The SQL code fragment below will create a single table to hold all the document sub headings. There would be one row in the table for each letter or document. Each record in this table will be linked to the main documents table by the DocumentID field. The SubHeadings field is specified as an XML data type and will hold the XML data structure representing all the heading titles and values.
CREATE TABLE ALMA.DocumentSubHeadings ( SubHeadingID INT IDENTITY(1,1) NOT NULL, DocumentID INT NOT NULL, SubHeadings XML NOT NULL, CONSTRAINT PK_SubHeadingID PRIMARY KEY (SubHeadingID) ) ON [PRIMARY]
As a teaching hospital junior doctors need to carry out lots of research projects. To facilitate them in finding candidate patients we need the system to quickly return letters, and thus patients, where a specific diagnosis is mentioned. To achieve this we need to create a suitable index on the XML field SubHeadings. The SQL code below demonstrates how to create a primary XML index for the field. Two additional indexes are subsequently created, one to map the XML paths and one to index the values. This allows us to target a particular path within the XML such as diagnosis, and quickly retrieve rows with a specific value.
/* create the primary XML index */ CREATE PRIMARY XML INDEX idx_SubHeadings ON ALMA.DocumentSubHeadings(SubHeadings); /* create XML index for the paths */ CREATE XML INDEX idx_SubHeadings_XmlPath ON ALMA.DocumentSubHeadings(SubHeadings) USING XML INDEX idx_SubHeadings FOR PATH; /* create XML index for the values */ CREATE XML INDEX idx_SubHeadings_XmlValue ON ALMA.DocumentSubHeadings(SubHeadings) USING XML INDEX idx_SubHeadings FOR VALUE; GO
Selecting records using the XML path
Lets look at an example select statement which returns all the document id’s which contain the diagnosis of dystrophy. The code sample below shows how to use the cross apply feature of SQL Server looking for just the XML nodes which match the diagnosis path. The statement then returns just those records which have a diagnosis value containing the word or words we are interested in finding.
SELECT DISTINCT DocumentID FROM ALMA.DocumentSubHeadings AS S CROSS APPLY S.SubHeadings.nodes('/alma/diagnoses/diagnosis') AS F(N) WHERE N.value('.', 'VARCHAR(MAX)') LIKE '%dystrophy%'
There are some limitations to consider when using the XML data type, and I have outlined them below. The biggest one being you can’t sort or group items within the XML schema. So for example, visual acuity is a number value and you wouldn’t be able to retrieve all the records with a visual acuity between two numeric values. To achieve that level of searching you would have to return all records which had visual acuity values and put the values into a temporary table. Then use the temporary table to sort and return records within the given range.
Will it scale?
We tested this question by creating a table with over 6 million records. The select statement still performed the searches within 2 seconds. One way to keep the search speed responsive would be to add a specialty identifying column with a suitable index. Most doctors would only want to search letters and patients from their own departments. By filtering on the specialty column first we would vastly reduce the computation required to search within the XML data structure.
To create the 6 million test records we used the SQL data generator tool supplied by Red Gate.
Limitations of the XML Data Type
Note the following general limitations that apply to the XML data type:
- The stored representation of XML data type instances cannot exceed 2 GB.
- It cannot be used as a subtype of a sql_variant instance.
- It does not support casting or converting to either text or ntext. Use varchar(max) or nvarchar(max) instead.
- It cannot be compared or sorted. This means an XML data type cannot be used in a GROUP BY statement.
- It cannot be used as a parameter to any scalar, built-in functions other than ISNULL, COALESCE, and DATALENGTH.
- It cannot be used as a key column in an index. However, it can be included as data in a clustered index or explicitly added to a nonclustered index by using the INCLUDE keyword when the nonclustered index is created.
I hope this case study has given you an introduction to using the XML data type and under what scenarios it could be used to help solve your data storage problems.
Positive, constructive criticism is encouraged and welcome. If you have ideas, suggestions, advice or problems with this article, please send me an email.
Table of contents
- Key Value Pairs
- NoSQL Databases
- The XML data type
- Selecting records using the XML path
- Will it scale?
- Limitations of the XML Data Type
- Paperback version