This article mentions common problems when performing pagination on a big data set and solutions to overcome them.
Scenario
When querying with a large dataset, we often use
SELCT .... ORDER BY a,b,c LIMIT M, N
, but do we know how does this query perform?
Let’s look at the explain query:
mysql> EXPLAIN SELECT * FROM pages ORDER BY likes LIMIT 10,10; +----+-------------+-------+-------+---------------+-------+---------+------+------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+-------+---------+------+------+-------+ | 1 | SIMPLE | pages | index | NULL | likes | 5 | NULL | 20 | | +----+-------------+-------+-------+---------------+-------+---------+------+------+-------+ 1 row in set (0.00 sec) mysql> EXPLAIN SELECT * FROM pages ORDER BY likes LIMIT 100,10; +----+-------------+-------+-------+---------------+-------+---------+------+------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+-------+---------+------+------+-------+ | 1 | SIMPLE | pages | index | NULL | likes | 5 | NULL | 110 | | +----+-------------+-------+-------+---------------+-------+---------+------+------+-------+ 1 row in set (0.00 sec) mysql> EXPLAIN SELECT * FROM pages ORDER BY likes LIMIT 1000,10; +----+-------------+-------+-------+---------------+-------+---------+------+------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+-------+---------+------+------+-------+ | 1 | SIMPLE | pages | index | NULL | likes | 5 | NULL | 1010 | | +----+-------------+-------+-------+---------------+-------+---------+------+------+-------+ 1 row in set (0.00 sec) mysql> EXPLAIN SELECT * FROM pages ORDER BY likes LIMIT 10000,10; +----+-------------+-------+-------+---------------+-------+---------+------+-------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+-------+---------+------+-------+-------+ | 1 | SIMPLE | pages | index | NULL | likes | 5 | NULL | 10010 | | +----+-------------+-------+-------+---------------+-------+---------+------+-------+-------+ 1 row in set (0.00 sec) mysql> EXPLAIN SELECT * FROM pages ORDER BY likes LIMIT 100000,10; +----+-------------+-------+-------+---------------+-------+---------+------+--------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+-------+---------+------+--------+-------+ | 1 | SIMPLE | pages | index | NULL | likes | 5 | NULL | 100010 | | +----+-------------+-------+-------+---------------+-------+---------+------+--------+-------+ 1 row in set (0.00 sec)
This result is similar when using LIMIT … OFFSET …. So, let’s remembers some best practices when using ORDER with LIMIT as follows (assumpt that we have a key key_a_b_c (a, b, c):
- ORDER BY a ASC, b DESC, c DESC will not use index, since we use different order directions
- ORDER BY b, c will not use index, since a is missing and index is from left-to-right
- LIMIT N is much faster than LIMIT M, N (or LIMIT N OFFSET M) when M is large enough (>100k for example)
- LIMIT 10000, 20 means we read 10020 rows, and throw away 10 000 rows, then return next 20 rows
The solution(s)
- Principal concepts: try to use external sequential key relating to the key in the target table.
- E.g.
#create table to store sequences CREATE TABLE seq ( seq_no int not null auto_increment, id int not null, primary key(seq_no), unique(id) ); #create the sequence TRUNCATE seq; INSERT INTO seq (id) SELECT id FROM pages ORDER BY id; #now get 10 rows from offset 1000000 SELECT pages.* FROM pages INNER JOIN seq USING(id) WHERE seq.seq_no >= 1000000 AND seq.seq_no < 1000010;
- There are some best practices in pagination from MySQLPeformanceBlog as follows:
- On the first query, fetch and cache all the results. Now it’s easy to know how many results there are, and fetching subsequent pages is no extra work for the database. In this model, you get to keep your “found X rows, showing page N of M” display that many people cherish.
- Don’t show all results. Not even Google lets you see the millionth result. You get to see N results and after that, you’re done. Limit the results to 100 or 500 or something. Remember, the further you go into the list, the more rows you are scanning and discarding with that LIMIT. This technique works great in conjunction with the first one. If you want to show 500 results, maybe you can fetch 501 and if the 501st row exists, display “more than 500 results found.”
- Don’t show the total count or the intermediate links to other pages. Show only the “next” link. (If people want to see the “previous” results, they can use their browser’s back button.) You can do this by fetching one more result than you want to display — for example, fetch 21 rows and display only 20. If there’s a 21st row, render the “next” link; if not, you’re at the end of the list. This way you don’t have to calculate how many results there are, and if caching is difficult then this is a simple way to avoid some of the costs.
- Estimate how many results there are. Again, Google does this and nobody complains. Use EXPLAIN and look at the “rows” column — that’s a fine estimate for some scenarios. (This tip doesn’t work in as many scenarios as others, but it’s still acceptable in many.)