How to guarantee unique of columns > 900 bytes
We have an app that threads together emails coming out of Exchange, using their messageid. To ensure threading works correctly, we need to ensure uniqueness of messageid, which we do with a unique index (we also need to be able to lookup by messageid when a message comes in).
We are currently porting the app from Oracle and PostgreSQL to SQL Server and are having problems with the 900 byte max length of an index. The problem is that the maximum size of a messageid (according to the Exchange docs) is 1877 bytes.
How can we guarantee uniqueness?
If you are using SQL Server 2005, you can use the hashbytes function to compute say MD5 hash on the string and persist it in your table. You can then create unique constraint on that instead of the message_id. Also, this works only if the hash value is less than 900 bytes which should be the case for something like messageid and depends on the algorithm. Hash functions in general are not guaranteed to produce unique output for all inputs but for all practical purposes it should suffice (when in doubt, test with possible generated inputs based on your schema).
See code below which uses SHA1 algorithm (will work only on SQL Server 2005 SP1 or later because RTM had a bug where hashbytes was marked as non-deterministic):
create table t1 ( c varchar(1024) not null, hash_c as cast(hashbytes('sha1', c) as binary(20)) unique )
go
insert into t1 (c) values( replicate('x', 1000))
insert into t1 (c) values( replicate('x', 1024))
insert into t1 (c) values( replicate('x', 1024))
go
select * from t1
go
drop table t1
go
You could do the same in SQL Server 2000 by computing hash on client and sending value to server or using extended SP (this approach should be done after careful consideration though).
Below are two other techniques that users trigger to enforce the unique key constraint. Both uses a computed column index to optimize the search.
-- Uses computed column index and primary key!
-- But usefulness of the computed column index depends on the data, statistics and query plan.
create table t1 ( id int not null primary key, c varchar(1024) not null, c_1 as checksum(c) )
create index idx_t1_c_1 on t1(c_1)
go
create trigger uq_t1_c
on t1
after insert, update
as
begin
if exists(select * from inserted as i
where exists(select * from t1
where t1.c_1 = checksum(i.c)
and t1.c = i.c
and t1.id <> i.id))
begin
raiserror('Unique key violation on t1.c!', 16, 1)
rollback
end
end
go
insert into t1 (id, c) values( 1, replicate('x', 1000))
insert into t1 (id, c) values( 2, replicate('x', 1024))
insert into t1 (id, c) values( 3, replicate('x', 1024))
go
select * from t1
go
drop table t1
go
-- Uses only computed column index and doesn't rely on primary key!
create table t1 ( id int not null primary key, c varchar(1024) not null, c_1 as checksum(c) )
create index idx_t1_c_1 on t1(c_1)
go
create trigger uq_t1_c
on t1
after insert, update
as
begin
if exists(select 1
from inserted as i
join t1
on t1.c_1 = checksum(i.c)
and t1.c = i.c
group by t1.c
having count(t1.c) > 1)
begin
raiserror('Unique key violation on t1.c!', 16, 1)
rollback
end
end
go
insert into t1 (id, c) values( 1, replicate('x', 1000))
insert into t1 (id, c) values( 2, replicate('x', 1024))
insert into t1 (id, c) values( 3, replicate('x', 1024))
go
select * from t1
go
drop table t1
go
This did occur to me - the problem is that hash algorithms are not guaranteed to produce unique output for all inputs (however statistically improbable that is), though in this case we have to produce a digest of the incoming email message anyway to see if it really is the same, so I think this is the right approach for us. I just have to work out the best way to support all three database platforms...