MySQL: Enforcing long unique BLOB / TEXT columns with UNIQUE INDEX

This post was written by eli on December 22, 2020
Posted Under: MySQL

Introduction

When creating indexes to TEXT / BLOB columns (and their variants), it’s required to specify how many characters the index should cover. In MySQL’s docs it’s usually suggested to keep them short in for better performance. There’s also a limit on the number of characters, which varies from one database engine to another, going from a few hundreds to a few thousands characters.

However it’s not unusual to use UNIQUE INDEX for making the database enforce the uniqueness of a field in the table. ON DUPLICATE KEY, INSERT IGNORE, UPDATE IGNORE and REPLACE can then be used to gracefully keep things tidy.

But does UNIQUE INDEX mean that the entire field remains unique, or is only the part covered by the index checked? Spoiler: Only the part covered by the index. In other words, MySQL’s uniqueness enforcement may be too strict.

Almost needless to say, if the index isn’t UNIQUE, it’s just a performance issue: If the number of covered characters is small, the database will more often fetch and discard the data of a row that turns out not to match the search criteria. But it doesn’t change the logic behavior. With UNIQUE INDEX, the number of characters does.

A simple test follows.

Trying the UNIQUE INDEX

I created a table with MySQL 5.1.47, with an index covering only 5 chars. Deliberately very short.

CREATE TABLE try (
	id INT UNSIGNED NOT NULL AUTO_INCREMENT,
	message BLOB NOT NULL,
	PRIMARY KEY (id),
	UNIQUE INDEX (message(5))
) engine = MyISAM;

Which ends up with this:

mysql> DESCRIBE try;
+---------+------------------+------+-----+---------+----------------+
| Field   | Type             | Null | Key | Default | Extra          |
+---------+------------------+------+-----+---------+----------------+
| id      | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| message | blob             | NO   | UNI | NULL    |                |
+---------+------------------+------+-----+---------+----------------+
2 rows in set (0.00 sec)

Inserting the first row:

mysql> INSERT INTO try(message) VALUES('Hello, world');
Query OK, 1 row affected (0.00 sec)

And now trying a second:

mysql> INSERT INTO try(message) VALUES('Hello there');
ERROR 1062 (23000): Duplicate entry 'Hello' for key 'message'

That’s it. It just looked at the first five chars. Trying a difference within this region:

mysql> INSERT INTO try(message) VALUES('Hell there');
Query OK, 1 row affected (0.00 sec)

No wonder, that worked.

mysql> SELECT * FROM try;
+----+--------------+
| id | message      |
+----+--------------+
|  1 | Hello, world |
|  2 | Hell there   |
+----+--------------+
2 rows in set (0.00 sec)

Handling large TEXT/BLOB

And that leaves the question: How is it possible to ensure uniqueness on large chunks of text or binary data? One solution I can think of is to add a column for a hash (say SHA1), and let the application calculate that hash for each row it inserts, and insert it along with it. And make the UNIQUE INDEX on the hash, not the text. Something like

CREATE TABLE try2 (
	id INT UNSIGNED NOT NULL AUTO_INCREMENT,
	message BLOB NOT NULL,
	hash TINYBLOB NOT NULL,
	PRIMARY KEY (id),
	UNIQUE INDEX (hash(40))
) engine = MyISAM;

But wait. MySQL supports hashing functions. Why not use them instead? Well, the problem is that if I want an INSERT statement push the data and its hash in one go they query becomes a bit nasty. What came to my mind is:

mysql> INSERT INTO try2(message, hash) (SELECT t.x, SHA1(t.x) FROM (SELECT 'Hello there' AS x) AS t);
Query OK, 1 row affected (0.00 sec)
Records: 1  Duplicates: 0  Warnings: 0

Can it get more concise than this? Suggestions are welcome. The double SELECT is required because I want the string literal to be mentioned once.

Isn’t it easier to let the application calculate the SHA1, and send it to the server by value? It’s a matter of taste, I guess.

Anyhow, trying again with exactly the same:

mysql> INSERT INTO try2(message, hash) (SELECT t.x, SHA1(t.x) FROM (SELECT 'Hello there' AS x) AS t);
ERROR 1062 (23000): Duplicate entry '726c76553e1a3fdea29134f36e6af2ea05ec5cce' for key 'hash'

and with something slightly different:

mysql> INSERT INTO try2(message, hash) (SELECT t.x, SHA1(t.x) FROM (SELECT 'Hello there!' AS x) AS t);
Query OK, 1 row affected (0.00 sec)
Records: 1  Duplicates: 0  Warnings: 0

So yep, it works:

mysql> SELECT * FROM try2;
+----+--------------+------------------------------------------+
| id | message      | hash                                     |
+----+--------------+------------------------------------------+
|  1 | Hello there  | 726c76553e1a3fdea29134f36e6af2ea05ec5cce |
|  2 | Hello there! | 6b19cb3790b6da8f7c34b4d8895d78a56d078624 |
+----+--------------+------------------------------------------+
2 rows in set (0.00 sec)

Once again, even though the example I showed demonstrates how to make MySQL calculate the hash, I would do it in the application.

Reader Comments

Eli, bumped into this post while researching similar topic – The way you implemented hash would not work actually. You will need to consider hash collision, otherwise your implementation will reject legit unique entries, although the chances are low, but it exists especially with a very large scale system.

#1 
Written By Xiaotian Guo on December 15th, 2021 @ 03:08

The question is how large system and what probability is acceptable. I Googled around a bit and found this:

http://web.archive.org/web/20110725085124/http://bitcache.org/faq/hash-collision-probabilities

If that’s true, the probability for a hash collision is most likely negligible for any realistic database.

#2 
Written By eli on December 15th, 2021 @ 11:03

Add a Comment

required, use real name
required, will not be published
optional, your blog address