MySQL: LIKE vs. REGEXP() performance testing
Introduction
At some point I needed to choose between using LIKE or REGEXP() for a not so simple string match. Without going into the details, the matching string contains a lot of wildcard segments, and while both would have done the job, I thought maybe REGEXP() would benefit from some extra information about the wildcard parts. It’s not that I cared that ‘%’ would match characters it shouldn’t, but I wanted so save some backtracking by telling the matching engine not to match just anything. Save some CPU, I thought. Spoiler: It was a nice thought, but no.
So I ran a few performance tests on a sample table:
CREATE TABLE product_info ( cpe_name BLOB NOT NULL, title BLOB NOT NULL, PRIMARY KEY (cpe_name(511)) ) engine = MyISAM;
Note that cpe_name is the primary key, and is hence a unique index.
I should also mention that the match patterns I use are for testing are practically useless for CPE name matching because they don’t handle escaped “:” characters properly. Just in case you know what a CPE is and you’re here for that. In short, these are just performance tests.
I did this on MySQL server version: 5.1.47, Source distribution. There are newer versions around, I know. Maybe they do better.
The art of silly queries
So there’s about more than half a million entries:
mysql> SELECT COUNT(*) FROM product_info; +----------+ | COUNT(*) | +----------+ | 588094 | +----------+ 1 row in set (0.00 sec)
Let’s ask it another way:
mysql> SELECT COUNT(*) FROM product_info WHERE cpe_name LIKE '%'; +----------+ | COUNT(*) | +----------+ | 588094 | +----------+ 1 row in set (0.08 sec)
Say what? LIKE ‘%’ is always true for a non-NULL BLOB. MySQL didn’t optimize this simple thing, and actually checked every single entry?
mysql> EXPLAIN SELECT COUNT(*) FROM product_info; +----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+ | 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away | +----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+ 1 row in set (0.00 sec) mysql> EXPLAIN SELECT COUNT(*) FROM product_info WHERE cpe_name LIKE '%'; +----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+ | 1 | SIMPLE | product_info | ALL | NULL | NULL | NULL | NULL | 588094 | Using where | +----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+ 1 row in set (0.00 sec)
Apparently it did. Other silly stuff:
mysql> SELECT COUNT(*) FROM product_info WHERE NOT cpe_name IS NULL; +----------+ | COUNT(*) | +----------+ | 588094 | +----------+ 1 row in set (0.08 sec) mysql> EXPLAIN SELECT COUNT(*) FROM product_info WHERE NOT cpe_name IS NULL; +----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+ | 1 | SIMPLE | product_info | ALL | PRIMARY | NULL | NULL | NULL | 588094 | Using where | +----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+ 1 row in set (0.00 sec)
Silly or what? The column is defined as “NOT NULL”. What is there to check? So maybe the idea is that if I make stupid queries, MySQL responds with stupid behavior. Well, not really:
mysql> SELECT COUNT(*) FROM product_info WHERE 1=1; +----------+ | COUNT(*) | +----------+ | 588094 | +----------+ 1 row in set (0.00 sec) mysql> EXPLAIN SELECT COUNT(*) FROM product_info WHERE 1=1; +----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+ | 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away | +----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+ 1 row in set (0.00 sec)
It’s more like MySQL makes optimizations only if they’re really obvious. MySQL != gcc.
LIKE vs REGEXP
So it turns out that LIKE is really fast, and it seems to take advantage of the index:
mysql> SELECT cpe_name FROM product_info WHERE cpe_name LIKE 'cpe:2.3:a:hummingbird:cyberdocs:-%'; +-------------------------------------------------+ | cpe_name | +-------------------------------------------------+ | cpe:2.3:a:hummingbird:cyberdocs:-:*:*:*:*:*:*:* | +-------------------------------------------------+ 1 row in set (0.00 sec)
The same query, only with a regex:
mysql> SELECT cpe_name FROM product_info WHERE cpe_name REGEXP '^cpe:2.3:a:hummingbird:cyberdocs:-.*'; +-------------------------------------------------+ | cpe_name | +-------------------------------------------------+ | cpe:2.3:a:hummingbird:cyberdocs:-:*:*:*:*:*:*:* | +-------------------------------------------------+ 1 row in set (0.21 sec)
Recall that indexing means that the rows are sorted. Because the initial part of the string is fixed, it’s possible to narrow down the number of rows to match with an index lookup, and then work on those that match that initial part. This was taken advantage of with the LIKE match, but apparently not with REGEXP():
mysql> EXPLAIN SELECT cpe_name FROM product_info WHERE cpe_name LIKE 'cpe:2.3:a:hummingbird:cyberdocs:-%'; +----+-------------+--------------+-------+---------------+---------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------------+-------+---------------+---------+---------+------+------+-------------+ | 1 | SIMPLE | product_info | range | PRIMARY | PRIMARY | 513 | NULL | 1 | Using where | +----+-------------+--------------+-------+---------------+---------+---------+------+------+-------------+ 1 row in set (0.00 sec) mysql> EXPLAIN SELECT cpe_name FROM product_info WHERE cpe_name REGEXP '^cpe:2.3:a:hummingbird:cyberdocs:-.*'; +----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+ | 1 | SIMPLE | product_info | ALL | NULL | NULL | NULL | NULL | 588094 | Using where | +----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+ 1 row in set (0.00 sec)
The conclusion is quite clear: For heavy duty matching, don’t use REGEXP() if LIKE can do the job. In particular if a lot of rows can be ruled out by virtue of the first characters in the string.
Making it harder for LIKE
Let’s warm up a bit:
mysql> SELECT cpe_name FROM product_info WHERE cpe_name LIKE 'cpe:2.3:a:humming%:cyberdocs:-%';
+-------------------------------------------------+
| cpe_name |
+-------------------------------------------------+
| cpe:2.3:a:hummingbird:cyberdocs:-:*:*:*:*:*:*:* |
+-------------------------------------------------+
1 row in set (0.00 sec)
It was quicker than measurable, mainly because there was little to do:
mysql> EXPLAIN SELECT cpe_name FROM product_info WHERE cpe_name LIKE 'cpe:2.3:a:humming%:cyberdocs:-%'; +----+-------------+--------------+-------+---------------+---------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------------+-------+---------------+---------+---------+------+------+-------------+ | 1 | SIMPLE | product_info | range | PRIMARY | PRIMARY | 513 | NULL | 80 | Using where | +----+-------------+--------------+-------+---------------+---------+---------+------+------+-------------+ 1 row in set (0.00 sec)
As expected, the index was used for the initial part of the string, and that left MySQL with 80 rows to actually do the matching (in this specific case). So it was quick.
But what if the wildcard is at the beginning of the string?
mysql> SELECT cpe_name FROM product_info WHERE cpe_name LIKE '%cpe:2.3:a:hummingbird:cyberdocs:-%'; +-------------------------------------------------+ | cpe_name | +-------------------------------------------------+ | cpe:2.3:a:hummingbird:cyberdocs:-:*:*:*:*:*:*:* | +-------------------------------------------------+ 1 row in set (0.10 sec)
So that took some considerable time, however still less than the REXEXP() case. The index didn’t help in this case, it seems:
mysql> EXPLAIN SELECT cpe_name FROM product_info WHERE cpe_name LIKE '%cpe:2.3:a:hummingbird:cyberdocs:-%';
+----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | product_info | ALL | NULL | NULL | NULL | NULL | 588094 | Using where |
+----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+
1 row in set (0.00 sec)
So apparently, the pattern was applied to all rows. It was still faster than REGEXP() making a very simple match. So the latter seems not to be optimized in MySQL.
And to wrap this up, an expression full with wildcards. Similar to the one I thought maybe REXEXP would do better:
mysql> SELECT COUNT(*) FROM product_info WHERE cpe_name LIKE 'cpe:2.3:a:%:%:%:-:%:%:%:%:%:%'; +----------+ | COUNT(*) | +----------+ | 13205 | +----------+ 1 row in set (0.11 sec) mysql> EXPLAIN SELECT COUNT(*) FROM product_info WHERE cpe_name LIKE 'cpe:2.3:a:%:%:%:-:%:%:%:%:%:%'; +----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+ | 1 | SIMPLE | product_info | ALL | PRIMARY | NULL | NULL | NULL | 588094 | Using where | +----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+ 1 row in set (0.00 sec)
So what happened here is that the initial part of the string didn’t help much, and the match seems to have been done on all rows. It took more or less the same time as the much simpler match pattern above.
Conclusion
There seems to be two main conclusions from this little set of experiments: The first one isn’t surprising: The most important factor is how many rows are being accessed, not so much what is done with them. And the second is that MySQL does some sensible optimizations when LIKE is used, in particular it narrows down the number of rows with the index, when possible. Something it won’t do with REGEXP(), even if with an “^” at the beginning of the regex.