T-SQL Tuesday #114 – The SQL Puzzle Party

T-SQL Tuesday logoThere were times when I tried to look for puzzles to solve, especially the T-SQL puzzles (what happened to the T-SQL Challenge site?). Now I don’t. Life is challenging as it is, especially if you work with SQL Server and really try to understand what’s going on.

So rather than coming up with some contrived problem for you to solve as part of this edition of T-SQL Tuesday (thank you Matthew McGiffen) I will share something that surprised me only last week. And yes, I have solved it already, and will be blogging more about it soon so no there is no big price for solving my production issue here 😉

Here is the scenario

There is a table that stores millions of records. It has a primary key, a date when a record was processed, a bit column indicating whether it was processed or not, and some text fields that are used for something, but in our example, it’s just data that takes space on pages.

There is also an application which is using nHibernate to generate a T-SQL query that retrieves one (just one at a time) records from that table where IsProcessed = 0. There are 10-50 records like that at peak times, in a table which holds tens of millions of records so making it very, very fast should be easy with a tiny little covering filtered index. Well… it turns out, SQL Server prefers to scan the clustered index instead.

Have a look

The challenge setup

use tempdb
go
drop table if exists dbo.LongProcessingTable
if not exists(select 1 from sys.tables where name = 'LongProcessingTable')
create table LongProcessingTable (
Id int not null identity primary key
,ProcessedOn datetime2 null
,IsProcessed bit null
,SomeData nvarchar(1024) not null
)

-- just some text to fill up the space on pages
declare @sometext nvarchar(1024) = (
select string_agg(convert(char(1),name), '')
from sys.all_objects
)

-- create just 100k records with some random date values
-- at this time all records are marked as processed
insert into dbo.LongProcessingTable(ProcessedOn, IsProcessed, SomeData)
select top(100000)
dateadd(second, -abs(checksum(a.object_id, b.object_id)%10000), getdate())
,1
,@sometext
from sys.all_objects a
cross join sys.all_objects b

-- now mark 10 rows as not processed
update d set IsProcessed = 0, ProcessedOn = null
from (
select top (10) *
from dbo.LongProcessingTable d
order by ProcessedOn desc
) d

Now the query:

declare @IsProcessed bit = 0

select top(1) Id, SomeData
from dbo.LongProcessingTable
where IsProcessed = @IsProcessed

The above query comes from the application and cannot be changed. It is what it is. And to help you start, here is the index I thought would work, but doesn’t.

create index IX_LongProcessingTable_NotProcessedYet
on dbo.LongProcessingTable(IsProcessed) include (SomeData)
where IsProcessed = 0

The index gets ignored and the server goes for the table scan instead.
Of course, there was somebody who discovered it earlier. I wasn’t all that surprised that Erik Darling blogged about it in 2015, 2017 and 2018 it turns out, he even says ‘IT IS KNOWN’… well, it wasn’t to me. But even now, with that knowledge, I still cannot change the query, so what can I do? How to make this query more efficient without changing it, and without creating a covering indexing on the whole table which can contain hundreds of GB of data just to get one row.

If you are still reading… well, enjoy the challenge. I will follow up with a few comments and a couple of my attempts at solving the problem later this month (hopefully).

Performance of JSON in SQL Server

Have you heard claims that in SQL Server the new JSON data support performs 10 times better than the old XML storage? And all that despite using nvarchar type for storage instead of the specialised xml type. Claims like this one on blogs.msdn.microsoft.com. I used to do a lot of XML in SQL Server and now I use JSON too so I decided to check how they compare when it comes to performance in SQL Server (2017).

Setting the Scene

XML Data Type has been supported by SQL Server since version 2005. It was implemented as a new built-in data type. It is possible to define a column as an XML type with, or without schema and then query the non-relational data together with relational data. It came with for xml clause and three main methods: modify, query and value defined on the XML type.

Let’s see an example of retrieving a value of a node from an XML variable:

declare @xml xml = N'<Test>123</Test>';
select @xml.value(N'(//Test/text())[1]', N'int');

JSON support was introduced in SQL Server 2016. That’s right, JSON support but not a JSON type. The data is stored as plain text, something that was possible before of course, but there is for json clause and 4 methods: isjson, json_modify, json_query and json_value.

An example of retrieving value from a JSON document can look something like this:

declare @json nvarchar(max) = N'{ "Test": "123" }';
select json_value(@json, '$.Test');

It would be difficult to compare performance of every single usecase so I have decided to focus on comparing single value retrieval from a document using .value and json_value for XML and JSON respectively.

Sample XML document

<test>
   <row key="ValueA" value="123" />
   <row key="ValueB" value="-432" />
   ...
   <row key="ValueZ" value="849" />
</test>

Sample JSON document

{
   "ValueA": "123",
   "ValueB": "-432",
   ...
   "ValueZ": 849"
}

I have tables created where there are only two columns, an integer primary key Id and a Data column of the test type. The tables are populated with 100,000 random records. The test consists of simply calculating a sum of a specific property from the document across all the rows in the table. To put things in perspective I will compare performance on the document types to aggregate on an int and an varchar type columns.

Test XML query

select sum(
   [Data].value('(//test/row[@key="ValueA"]/@value)[1]', 'int')
) 
from dbo.XmlData

Test JSON query

select sum(
   convert(int, json_value([Data], '$.ValueA'))
) 
from dbo.JsonData

There are no indexes or computed columns to help with the performance of the queries as the objective is to simply measure performance of parsing and document querying.

First Comparison Results

Json in SQL Server Performance Comparison

That can’t be right!? Can it? JSON query took on average of 0.43 seconds which is 36 times faster than the 16.79 seconds it took to sum the values out of the XML documents. Better (or worse) still when compared to an XML with schema which took 40.47 seconds the JSON type seems to be 94 times faster. So what happened to the 10 times faster claim? Well, the native VARCHAR column performed 10 times faster at 0.04 seconds than the JSON one. the INT column was still faster at 0.03 seconds.

It is also worth looking at the logical reads. Native INT and VARCHAR columns required just over 200 logical reads while both JSON and schema-less XML took around 12,000 logical reads, while the XML with schema required over 34,000.

Playing with storage types

The above were rather unexpected results. Both in terms of CPU and the logical reads. And that got me thinking, what if I could reduce the number of reads? The simplest way to do so would be to reduce the size of data by changing the types. I have created additional tables with the exact copy of the JSON test records but using VARCHAR(max), NVARCHAR(512) and VARCHAR(512) types in addition to the standard (as in most common in all the examples I have seen) NVARCHAR(max).

Performance of JSON data with different storage types

Another surprise! As expected using VARCHAR instead of unicode NVARCHAR halved the number of logical reads but that didn’t translate to improved CPU times. Using json_value on VARCHAR(MAX) was significantly (and consistently) worse than the same method on NVARCHAR(MAX). Specific maximum length types performed better than the the types with the max length but there too unicode required less CPU time despite having twice as many page reads.

At the beginning or at the end

So is it really possible that the old XML type is 36 times worse than on-the-fly JSON parsing? Perhaps there are some benefits to XML storage? I’d imagine the JSON parser is clever enough to stop parsing after the first matching node is found which probably means that the performance will depend on weather we are looking for a value of a property close to the beginning or close to the end of the document? And, perhaps, that’s where XML will perform better being pre-parsed type? I have modified the tests to look at the first and last values from the documents.

Performance depending on value position in a document

Here the results were more as expected. When using json_value on a NVARCHAR(max) data ValueZ being at the end of the document structure performed worse than ValueA which is at the beginning. In fact ValueZ queries took 10 times longer than ValueA. XML still has shown differences but ValueZ queries took only 30% longer than the ValueA ones. Still worst case JSON query was 4 times faster than the best case XML query!

Storage

So that shows that JSON on unicode types uses less CPU time but requires more logical reads which suggests more pressure on RAM and potentially more I/O. How does the storage requirements of all those types compare?

Data storage comparison

XML with Schema is the most expensive in terms of storage, by quite some margin. Despite being visually more verbose the XML type actually takes slightly less disk space than the unicode json which is twice as much as non-unicode json. As one would expect the standard INT columns require least.

Conclusions

JSON data in SQL Server 2017 really is faster than the old XML type. How much faster will depend on the position of the specific value in the document. That’s important, the structure and ordering of properties in the document matters. If you are in control of the JSON document structure consider putting properties more likely to be queried at the beginning of the document and those you know you will index at the end so that you don’t have to parse through them when working with properties which are not indexed. Types matter too. NVARCHAR is faster than VARCHAR but will use twice logical reads (and therefor RAM). Specific lengths perform better than using (max).