BLOB, TEXT, and case sensitivity: MySQL won’t treat them the same
When I first discovered that there is both BLOB and TEXT in databases, I was puzzled. They occupy the same amount of disk space, and the database doesn’t alter the data itself anyhow. Why two?
Of course, I followed the stream, and went for TEXT and VARCHAR for everything, since I don’t store binary data in the database. That may not be the optimal choice in all cases.
It turns out, that MySQL goes a long way to “help” the user with string operations. In particular, when the table column is defined as TEXT, VARCHAR and their derivatives, the database will compare strings as they would be understood by humans. And if that definition sounds ambiguous to you, you’re in good company: Different versions of MySQL compare text strings differently. For example, snipping leading and trailing whitespaces from the strings before comparing them: Some versions will do this, others won’t.
The bottom line is that if you change your MySQL server, tiny bugs may creep in. All these corner cases may behave differently. This is the classic case of some entry disappearing from a list of 53478, without anyone noticing.
Another issue to consider is character collation: When using TEXT, VARCHAR and friends, we also have that ‘Ö’ and ‘OE’ are treated as the same character (when German character set is used, among others). Just an example.
The solution: Make MySQL treat your data as a binary. A BLOB column type, for example. You loose all those extra “features”, but gain stability over database versions and flavors. That means that you have to handle all the case-insensitivity issues yourself, as well cleaning up the strings properly. With good practices, that’s not an issue. It’s a matter of if you want to take responsibility, or let the database iron those small wrinkles for you.
And finally: What about uniqueness in tables? Here’s a short session, using MySQL 4.0.24:
mysql> CREATE TABLE try (mydata TEXT, UNIQUE INDEX (mydata(20)) ); Query OK, 0 rows affected (0.00 sec) mysql> INSERT INTO try(mydata) VALUES('Hello'); Query OK, 1 row affected (0.00 sec) mysql> SELECT * FROM try WHERE mydata='hELLO'; +--------+ | mydata | +--------+ | Hello | +--------+ 1 row in set (0.00 sec)
This was pretty much expected: As a TEXT column, the comparison was case-insensitive. And of course, the capital “H” was saved in the table, even though that doesn’t matter in string comparisons.
But what happens if we want to add an entry, which violates the uniqueness, when considering the strings in a case-insensitive manner?
mysql> INSERT INTO try(mydata) VALUES('HELLO'); ERROR 1062: Duplicate entry 'HELLO' for key 1
As expected, MySQL didn’t swallow this. “Hello” and “HELLO” are the same, so they can’t live together when “mydata” is restricted as UNIQUE.
So let’s drop the table, and try this again, this time with a BLOB column. Spoiler: Everything is case-sensitive now.
mysql> DROP TABLE try; Query OK, 0 rows affected (0.00 sec) mysql> CREATE TABLE try (mydata BLOB, UNIQUE INDEX (mydata(20)) ); Query OK, 0 rows affected (0.01 sec) mysql> INSERT INTO try(mydata) VALUES('Hello'); Query OK, 1 row affected (0.00 sec) mysql> SELECT * FROM try WHERE mydata='hELLO'; Empty set (0.00 sec)
(why should it find anything? ‘Hello’ and ‘hELLO’ are completely different!)
mysql> SELECT * FROM try; +--------+ | mydata | +--------+ | Hello | +--------+ 1 row in set (0.00 sec) mysql> INSERT INTO try(mydata) VALUES('HELLO'); Query OK, 1 row affected (0.00 sec) mysql> SELECT * FROM try; +--------+ | mydata | +--------+ | Hello | | HELLO | +--------+ 2 rows in set (0.01 sec)
(No problems with the uniqueness: ‘Hello’ and ‘HELLO’ are not the same in a BLOB)
To summarize all this: Before choosing between TEXT or BLOB, ask yourself if you want the database to treat the string exactly as it is, or if you want some forgiveness regarding case, whitespaces and natural language issues.
For example, are you sure that you want the user name and password as text? In particular, would you like the password case-insensitive? Do you want HTTP links as text? The address itself is indeed case-insensitive to the web, but CGI arguments (everything after the question mark, if present) is case-sensitive (YouTube video IDs, for example).
Usually, using a text column is OK. But it’s a choice one has to make.
Reader Comments
Thanks… simple solution for my problem :)
Thanks, I come to know the difference between Text and Blob, and it helps me to solve my searching.