Serial key optimizations

Many database management systems provide some way to generate an integer sequence: SERIAL type in PostgreSQL, AUTO_INCREMENT modifier in MySQL, IDENTITY in MS SQL and so on. This is often used for generating surrogate primary keys for tables. But what about performance? In this post, I want to compare the performance of sequential vs random index inserts and talk about index structure.

Time comparison

I’m going to use my favourite PostgreSQL for demo purposes.

Start PostgreSQL using docker:

docker run \
	--name=postgres \
	--volume $(pwd)/data:/var/lib/postgresql/data \
	--env POSTGRES_HOST_AUTH_METHOD=trust \
	--detach \
	postgres:13.2

and then run psql shell:

docker exec \
	--interactive \
	--tty \
	postgres \
	psql --username postgres

I’m creating two tables with only one field:

CREATE TABLE nat (
	id BIGINT PRIMARY KEY
);

CREATE TABLE sur (
	id BIGSERIAL PRIMARY KEY
);

BIGINT and BIGSERIAL are similar data types - both 8-byte long integers, but BIGSERIAL automatically creates a sequence and defines a field equivalent to bigint NOT NULL DEFAULT nextval('sur_id_seq'::regclass).

Now with timing measures turned on (\timing on in psql shell) I am inserting the same amount of data to the tables:

postgres=# INSERT INTO nat(id)
	SELECT (random() * 10000000000000)::bigint FROM generate_series(1, 1000000);
INSERT 0 1000000
Time: 80216.521 ms (01:20.217)
postgres=# INSERT INTO sur SELECT * FROM generate_series(1, 1000000);
INSERT 0 1000000
Time: 68900.582 ms (01:08.901)

The difference is not huge, but noticeable. Now let’s take a look at tables and indexes size:

postgres=# select pg_size_pretty(pg_table_size('nat'));
 pg_size_pretty
----------------
 35 MB
(1 row)

postgres=# select pg_size_pretty(pg_table_size('sur'));
 pg_size_pretty
----------------
 35 MB
(1 row)

postgres=# select pg_size_pretty(pg_indexes_size('nat'));
 pg_size_pretty
----------------
 29 MB
(1 row)

postgres=# select pg_size_pretty(pg_indexes_size('sur'));
 pg_size_pretty
----------------
 21 MB
(1 row)

Tables sizes are equal, which was expected since I added the same count of rows of the same width, but indexes sizes are different. Why is that? To understand we need to explore B-tree index internal.

RDBMS indexes

The default type of index in many DBMS is a B+-tree (although there are exist other types of indexes) - self-balacing tree data structure.

Here’s how typical B+-tree would look like:

Let’s imagine we want to insert value 11 to our B-tree. DBMS should find the node to insert and in order to do so it should traverse from the root down to the leaf node:

The leaf node cannot fit an incoming tuple and it means that it should be split to two nodes. Page split must also insert a new downlink to the new page in the parent page, which causes the parent to split in turn. Page splits “cascade upwards” in a recursive fashion. (I’m not going to dive deep into how splits/merges work in B+-tree, but you can play with a great visualizing tool.

As a result of inserting one value several nodes had to be rewritten. The resulting tree looks following:

For the sake of simplicity in my example, every node can hold up to 2 values. In real DBMS every node occupies one page, which is 8kb for PostgreSQL for example, so every node potentially can hold hundreds of values (it really depends on value’s data type). In any case, split logic remains the same - if a page is full it gets split into two half-full pages.

Index size

Now let’s take a closer look at a node split. Here’s a fragment of B+tree with, each node can hold up to 4 items. We’re trying to add value 3 to the index. The node that should hold this value is highlighted in green:

The node is full, so it has to be split. RDBMSes split full nodes into two of relatively equal sizes and inserts the new value into one of them. See the result in the following image:

As a result, we get two half-empty pages (It doesn’t look half-empty on the picture, because 1 item is already 25% of the node, but if you remember the real page is 8k and can hold hundreds of items).

Now let’s see what happens when we’re trying to add a new item to the rightmost node and it’s full:

It’s a pretty common case so RDBMS treats it specially and violates the rule that any leaf B+tree node should be at least half-full: it simply allocates a new page on the right and doesn’t do a page split:

If we check PostgreSQL’s code it has following comment (source):

If the page is the rightmost page on its level, we instead try to arrange to leave the left split page fillfactor% full. In this way, when we are inserting successively increasing keys (consider sequences, timestamps, etc) we will end up with a tree whose pages are about fillfactor% full, instead of the 50% full result that we’d get without this special case.

That explains indexes sizes difference: when we always add increasing values (and they go to the rightmost leaf of a B+tree) we end up with densely packed pages and don’t have wasted space in the middle of the tree.

Insertion speed

So what about speed difference? As I mentioned earlier this always-increasing data pattern is super common in DBMS so developers added one more optimisation: the rightmost node is cached in memory and there’s no need to traverse the tree from the root node to find the actual leaf. We can go back to PostgreSQL source and check comments:

It’s very common to have an index on an auto-incremented or monotonically increasing value. In such cases, every insertion happens towards the end of the index. We try to optimize that case by caching the right-most leaf of the index. […] we simply search within the cached block and insert the key at the appropriate location. We call it a “fastpath”.

Some other DBMSes have this optimisation too: for example SQLite calls it quickbalance.

Conclusion

If you have the ability to choose your key (and index) it’s a good idea to opt for autoincremented value: it will save space and bust speed.