How to guarantee unique of columns > 900 bytes

We have an app that threads together emails coming out of Exchange, using their messageid. To ensure threading works correctly, we need to ensure uniqueness of messageid, which we do with a unique index (we also need to be able to lookup by messageid when a message comes in).

We are currently porting the app from Oracle and PostgreSQL to SQL Server and are having problems with the 900 byte max length of an index. The problem is that the maximum size of a messageid (according to the Exchange docs) is 1877 bytes.

How can we guarantee uniqueness?

[553 byte] By [fiddlesticks] at [2007-12-23]
# 1
If you are using SQL Server 2005, you can use the hashbytes function to compute say MD5 hash on the string and persist it in your table. You can then create unique constraint on that instead of the message_id. Also, this works only if the hash value is less than 900 bytes which should be the case for something like messageid and depends on the algorithm. Hash functions in general are not guaranteed to produce unique output for all inputs but for all practical purposes it should suffice (when in doubt, test with possible generated inputs based on your schema).

See code below which uses SHA1 algorithm (will work only on SQL Server 2005 SP1 or later because RTM had a bug where hashbytes was marked as non-deterministic):

create table t1 ( c varchar(1024) not null, hash_c as cast(hashbytes('sha1', c) as binary(20)) unique )
go
insert into t1 (c) values( replicate('x', 1000))
insert into t1 (c) values( replicate('x', 1024))
insert into t1 (c) values( replicate('x', 1024))
go
select * from t1
go
drop table t1

go

You could do the same in SQL Server 2000 by computing hash on client and sending value to server or using extended SP (this approach should be done after careful consideration though).

Below are two other techniques that users trigger to enforce the unique key constraint. Both uses a computed column index to optimize the search.

-- Uses computed column index and primary key!
-- But usefulness of the computed column index depends on the data, statistics and query plan.
create table t1 ( id int not null primary key, c varchar(1024) not null, c_1 as checksum(c) )
create index idx_t1_c_1 on t1(c_1)
go
create trigger uq_t1_c
on t1
after insert, update
as
begin
if exists(select * from inserted as i
where exists(select * from t1
where t1.c_1 = checksum(i.c)
and t1.c = i.c
and t1.id <> i.id))
begin
raiserror('Unique key violation on t1.c!', 16, 1)
rollback
end
end
go
insert into t1 (id, c) values( 1, replicate('x', 1000))
insert into t1 (id, c) values( 2, replicate('x', 1024))
insert into t1 (id, c) values( 3, replicate('x', 1024))
go
select * from t1
go
drop table t1
go


-- Uses only computed column index and doesn't rely on primary key!
create table t1 ( id int not null primary key, c varchar(1024) not null, c_1 as checksum(c) )
create index idx_t1_c_1 on t1(c_1)
go
create trigger uq_t1_c
on t1
after insert, update
as
begin
if exists(select 1
from inserted as i
join t1
on t1.c_1 = checksum(i.c)
and t1.c = i.c
group by t1.c
having count(t1.c) > 1)
begin
raiserror('Unique key violation on t1.c!', 16, 1)
rollback
end
end
go
insert into t1 (id, c) values( 1, replicate('x', 1000))
insert into t1 (id, c) values( 2, replicate('x', 1024))
insert into t1 (id, c) values( 3, replicate('x', 1024))
go
select * from t1
go
drop table t1
go

# 2
You also can use triggers to maintain uniqueness.
gavrilenko_s at 2007-8-30 > top of Msdn Tech,SQL Server,Transact-SQL...
# 3
Umachandar Jayachandran - MS wrote:

If you are using SQL Server 2005, you can use the hashbytes function to compute say MD5 hash on the string and persist it in your table. You can then create unique constraint on that instead of the message_id.

This did occur to me - the problem is that hash algorithms are not guaranteed to produce unique output for all inputs (however statistically improbable that is), though in this case we have to produce a digest of the incoming email message anyway to see if it really is the same, so I think this is the right approach for us. I just have to work out the best way to support all three database platforms...

Umachandar Jayachandran - MS wrote:

Note that you can't use computed column approach (to create constraint) because the hashbytes function is non-deterministic. Also, this works only if the hash value is less than 900 bytes which should be the case for something like messageid.

We're using SHA, so it's never going to be more than 160 bits.

Any idea why hashbytes is non-deterministic? No matter how many times I run this:

select hashbytes('SHA', replicate('x', 1877))


I always get:

0xCC48A6FD0B8C7B758DA27F8D1BD3E2F38D468AC2

But if I create an index on a column that is generated using hashbytes() I get the error:

"Column 'myhash' in table 'test' cannot be used in an index or statistics or as a partition key because it is non-deterministic."

fiddlesticks at 2007-8-30 > top of Msdn Tech,SQL Server,Transact-SQL...
# 4

gavrilenko_s wrote:
You also can use triggers to maintain uniqueness.

My understanding is that a unique constraint or index is considerably faster than doing a uniqueness check in a trigger. Also, without reducing the size of the messageid column so it can be indexed we'd be forced into a full table scan...

fiddlesticks at 2007-8-30 > top of Msdn Tech,SQL Server,Transact-SQL...
# 5
Yes, hashbytes should be marked as deterministic. It was a bug. We found it after RTM and fixed it in SP1. So if you try it on SP1 like below it will work fine. I tried the code before on RTM and didn't check on SP1.

create table t1 ( c varchar(1024) not null, hash_c as cast(hashbytes('sha1', c) as binary(20)) unique )
go
insert into t1 (c) values( replicate('x', 1000))
insert into t1 (c) values( replicate('x', 1024))
insert into t1 (c) values( replicate('x', 1024))
go
select * from t1
go
drop table t1

I agree that any hash function is not guaranteed to produce unique output for all values but it should suffice for all practical purposes. A safe approach would be the hybrid (computed column index + trigger) approach that I outlined using checksum/binary_checksum if you are unsure of the input data.

UmachandarJayachandran-MS at 2007-8-30 > top of Msdn Tech,SQL Server,Transact-SQL...
# 6
That's great - thanks for your help.
fiddlesticks at 2007-8-30 > top of Msdn Tech,SQL Server,Transact-SQL...
SQL Server
Transact-SQL

SQL Server Hot Topic

SQL Server New Topic

SQL Server

Site Classified