One of the lesser known features of modern SQL are so-called “Common Table Expressions” (CTE) or “WITH
queries”. I’ll explain the mental model that helped me make sense of them, and how to use them to execute recursive queries. Afterwards I’ll show how to apply these techniques in Django.
Syntactically a CTE consists of one or more statements marked with WITH
and placed before the statement they relate to, within the same SQL query. Conceptually these additional statements behave as if defining a view, or temporary table(s), that is valid only within this one SQL query.
The intended use is for simplifying complex or repeated operations and pulling them out of the main query. Let’s say you have normalized your database beyond all reason and have the following schema for storing names1:
CREATE TABLE name ( id bigint NOT NULL, name character varying(50) NOT NULL, CONSTRAINT name_pkey PRIMARY KEY (id) ); CREATE TABLE person ( id bigint NOT NULL, first_name_id bigint NOT NULL, last_name_id bigint NOT NULL, CONSTRAINT person_pkey PRIMARY KEY (id), CONSTRAINT first_name FOREIGN KEY (first_name_id) REFERENCES name (id), CONSTRAINT last_name FOREIGN KEY (last_name_id) REFERENCES name (id) );
Given this schema, you’d have to use something like SELECT CONCAT(first.name, ' ', last.name)
everywhere you wanted a full name, together with a join along the foreign keys.
Even this small example becomes tiresome pretty fast. It’s worse for more complex cases and gets even more complicated when you consider computed or aggregate functions.
The WITH
statement lets you extract the complications from your query and get them over with first. A query for full name substrings could look like this:
WITH full (id, name) AS ( SELECT p.id AS id, CONCAT(first.name, ' ', last.name) AS name FROM person p LEFT JOIN name first LEFT JOIN name last WHERE first.id = p.first_name_id, last.id = p.last_name_id ) SELECT id, name FROM full WHERE name LIKE '%om Ridd%';
This behaves as if a temporary table named full
with columns id
, name
is created and filled with the results from the first SELECT
statement (the CTE), just before executing the second, main, SELECT
statement. In the main SELECT
you do not need to worry about the JOIN
s or other details from the inside of the CTE. It will appear as if it were a normal table.
Multiple CTE in one query are possible, and neither the CTE nor the main query are limited to SELECT
. The PostgreSQL documentation has an example with DELETE … RETURNING
and INSERT INTO … SELECT
that moves rows between tables.
But so far, this is only syntactic sugar, an easier way to write something you can also express otherwise. You could achieve the same result with a (possibly materialized) view, though you’d have to modify the database schema. Or you could include everything from the CTE in your main query, but have to take great care not to trip up when using DISTINCT
or GROUP BY
. Or you have an ORM that will handle all the complicated queries for you and will allow reuse of complex expressions.
Then everything changes when you learn about WITH RECURSIVE
. The magic consists of two building blocks: First, a WITH RECURSIVE
statement may refer to itself. That is, in the SELECT
that fills data into your imaginary temporary table you can read data from the imaginary temporary table. This, in itself, is not very useful, since the table starts out empty. The second building block is extending the CTE SELECT
to include data from the start.
The canonical use case of a recursive query looks like this:
WITH RECURSIVE my_cte AS ( SELECT ... /* base case*/ FROM somewhere_else WHERE ... UNION SELECT ... /* recursive step */ FROM somewhere_else JOIN my_cte WHERE ... ) SELECT * FROM my_cte;
The two SELECT
statements in the CTE perform the functions of base case and recursive step in the definition of recursion. The semantics of WITH RECURSIVE
is such that it repeats the query, storing results in the imaginary temporary table, until no new rows are being added. In other words, the stop condition is implicitly fixed and cannot be altered. You have to make sure that it terminates eventually, for example with a LIMIT
clause.
An artificial example would be to create a sequence of numbers:
WITH RECURSIVE numbers (n) AS ( SELECT 1 AS n UNION SELECT numbers.n + 1 AS n FROM numbers ) SELECT * FROM numbers LIMIT 10;
This example shows that the base case needn’t necessarily come from another table. You can also see that in most simple cases the final, main, SELECT
is just a dummy SELECT *
(in this case with an added LIMIT
) because most of the heavy lifting is being done in the CTE.
Recursive CTEs allow SQL to do things not possible without. A prime example is operating on trees in adjacency list form. This is the simplest and most obvious way to represent hierarchical data, but without recursive CTEs it’s not possible to directly express some common queries2, such as retrieving all ancestors of a node.
To illustrate here’s an example with the required basics of a simple hierarchical tree structure:
CREATE TABLE node ( id bigint NOT NULL, parent_id bigint, name character varying (50) NOT NULL, CONSTRAINT node_pkey PRIMARY KEY (id), CONSTRAINT parent FOREIGN KEY (parent_id) REFERENCES node (id) );
Every node has a name and an optional foreign key reference to its parent node. It’s very easy to query for a node’s parent node, or all child nodes of a specific node. All other queries are more complicated, or impossible, without a recursive CTE. Let’s take this fictional table:
id | name | parent_id |
---|---|---|
1 | Enoch | NULL |
2 | Methuselah | 1 |
3 | Lamech | 2 |
4 | Noah | 3 |
5 | Shem | 4 |
6 | Ham | 4 |
7 | Japeth | 4 |
To get all children of Noah:
SELECT child.name FROM node child JOIN node parent ON child.parent_id = parent.id WHERE parent.name = 'Noah';
To get Lamech’s father:
SELECT parent.name FROM node child JOIN node parent ON child.parent_id = parent.id WHERE child.name = 'Lamech';
You could extend the queries with additional JOIN
s to also handle grandparents, great-grandparents and so on, but it’s impossible in normal SQL to handle arbitrarily long chains of ancestors or descendants. Now consider this recursive query for all of Japeth’s ancestors:
WITH RECURSIVE ancestors (id, name, parent_id) AS ( SELECT id, name, parent_id FROM node WHERE name = 'Japeth' UNION SELECT parent.id, parent.name, parent.parent_id FROM node parent JOIN ancestors child ON parent.id = child.parent_id ) SELECT name FROM ancestors;
The second SELECT
is a bit confusing. Here’s what happens: At first the imaginary temporary table contains only one row, corresponding to Japeth, the starting point for the following steps. The second SELECT
then, for every node already in ancestors
(now aliased to child
, because these are all children considered so far), finds its parent from node
and adds it to the ancestors
.
So at every step the ancestors
imaginary temporary table contains a set of all parents discovered so far, and all their parents (that is, all nodes that consider the existing nodes their child) are added until no more new parents are discovered.
A common variant is to include a path length variable to quantify the degree of relationship (and for example also be able to exactly query for all paths of a specific length). Another technique is to not pass through the entire row in the CTE, but only operate on the primary keys and then JOIN
for the remaining columns later. Let’s look at an example for all descendants of Methuselah:
WITH RECURSIVE descendants (id, n) AS ( SELECT id, 0 FROM node WHERE name = 'Methuselah' UNION SELECT child.id, parent.id+1 FROM node child JOIN descendants parent ON parent.id = child.parent_id ) SELECT descendants.n, node.name FROM descendants LEFT JOIN node ON descendants.id = node.id;
You should see a common pattern to keep in mind: In the second SELECT
the node
object (called child
) conceptually matches our CTE name (descendants
), while the CTE reference is the reverse (parent
)! We’re adding a node
child for every parent already in the CTE.
Bonus Round: CTEs in Django
The django-cte package allows using CTEs in the Django ORM with the normal query API. The equivalent Django code for the last example looks like this:
# model.py from django.db import models from django_cte import CTEManager class Node(models.Model): objects = CTEManager() id = AutoField(primary_key=True) name = models.CharField(max_length=50, null=False) parent = models.ForeignKey("self", null=True, on_delete=CASCADE)
from django.db.models import Value, IntegerField from django_cte import With from .models import Node descendants = With.recursive( lambda cte: Node.objects.filter( name="Methusaleh" ).values( "id", n=Value(0, output_field=IntegerField()) ).union( cte.join(Node, parent_id=cte.col.id).values( "id", n=cte.col.n + Value(1, output_field=IntegerField()) ) ) ) descendants_set = descendants.join( Node, id=descendants.col.id ).with_cte(descendants).annotate( n=descendants.col.n )
A few observations to keep in mind:
- In Django all database expressions need to start with a
QuerySet
. - The
django_cte.With
object wraps aQuerySet
to allow additional methods, includingwith_cte()
which adds the common table expression. Remember that in SQL the CTE goes before the main query (the QuerySet), which might be confusing here. - In order to map the self-referential nature of
WITH RECURSIVE
to Python syntax,django_cte.With.recursive()
takes a callable that produces a QuerySet. The callable receives a reference to the CTE. django_cte.With.recursive()
needs to beJOIN
ed to the underlying Model to be useful. You also need to.annotate()
any computed columns from your CTE to use them outside of the CTE.
Don’t. Instead, please read Falsehoods Programmers Believe About Names ↩
Workarounds include nested sets and materialized paths, but with additional requirements on the data structures and some maintenance cost on updates ↩