Understanding Common Table Expressions in SQL

One of the lesser known features of modern SQL are so-called “Common Table Expressions” (CTE) or “WITH queries”. I’ll explain the mental model that helped me make sense of them, and how to use them to execute recursive queries. Afterwards I’ll show how to apply these techniques in Django.

Syntactically a CTE consists of one or more statements marked with WITH and placed before the statement they relate to, within the same SQL query. Conceptually these additional statements behave as if defining a view, or temporary table(s), that is valid only within this one SQL query.

The intended use is for simplifying complex or repeated operations and pulling them out of the main query. Let’s say you have normalized your database beyond all reason and have the following schema for storing names1:

    id bigint NOT NULL,
    name character varying(50) NOT NULL,
    CONSTRAINT name_pkey PRIMARY KEY (id)
    id bigint NOT NULL,
    first_name_id bigint NOT NULL,
    last_name_id bigint NOT NULL,
    CONSTRAINT person_pkey PRIMARY KEY (id),
    CONSTRAINT first_name FOREIGN KEY (first_name_id)
        REFERENCES name (id),
    CONSTRAINT last_name FOREIGN KEY (last_name_id)
        REFERENCES name (id)

Given this schema, you’d have to use something like SELECT CONCAT(first.name, ' ', last.name) everywhere you wanted a full name, together with a join along the foreign keys.

Even this small example becomes tiresome pretty fast. It’s worse for more complex cases and gets even more complicated when you consider computed or aggregate functions.

The WITH statement lets you extract the complications from your query and get them over with first. A query for full name substrings could look like this:

WITH full (id, name) AS (
    p.id AS id,
    CONCAT(first.name, ' ', last.name) AS name
    person p
    LEFT JOIN name first
    LEFT JOIN name last
    first.id = p.first_name_id,
    last.id = p.last_name_id
SELECT id, name FROM full
WHERE name LIKE '%om Ridd%';

This behaves as if a temporary table named full with columns id, name is created and filled with the results from the first SELECT statement (the CTE), just before executing the second, main, SELECT statement. In the main SELECT you do not need to worry about the JOINs or other details from the inside of the CTE. It will appear as if it were a normal table.

Multiple CTE in one query are possible, and neither the CTE nor the main query are limited to SELECT. The PostgreSQL documentation has an example with DELETE … RETURNING and INSERT INTO … SELECT that moves rows between tables.

But so far, this is only syntactic sugar, an easier way to write something you can also express otherwise. You could achieve the same result with a (possibly materialized) view, though you’d have to modify the database schema. Or you could include everything from the CTE in your main query, but have to take great care not to trip up when using DISTINCT or GROUP BY. Or you have an ORM that will handle all the complicated queries for you and will allow reuse of complex expressions.

Then everything changes when you learn about WITH RECURSIVE. The magic consists of two building blocks: First, a WITH RECURSIVE statement may refer to itself. That is, in the SELECT that fills data into your imaginary temporary table you can read data from the imaginary temporary table. This, in itself, is not very useful, since the table starts out empty. The second building block is extending the CTE SELECT to include data from the start.

The canonical use case of a recursive query looks like this:

  SELECT ... /* base case*/
    FROM somewhere_else
    WHERE ...
  SELECT ... /* recursive step */
    FROM somewhere_else JOIN my_cte
    WHERE ...
SELECT * FROM my_cte;

The two SELECT statements in the CTE perform the functions of base case and recursive step in the definition of recursion. The semantics of WITH RECURSIVE is such that it repeats the query, storing results in the imaginary temporary table, until no new rows are being added. In other words, the stop condition is implicitly fixed and cannot be altered. You have to make sure that it terminates eventually, for example with a LIMIT clause.

An artificial example would be to create a sequence of numbers:

WITH RECURSIVE numbers (n) AS (
  SELECT numbers.n + 1 AS n
    FROM numbers
SELECT * FROM numbers LIMIT 10;

This example shows that the base case needn’t necessarily come from another table. You can also see that in most simple cases the final, main, SELECT is just a dummy SELECT * (in this case with an added LIMIT) because most of the heavy lifting is being done in the CTE.

Recursive CTEs allow SQL to do things not possible without. A prime example is operating on trees in adjacency list form. This is the simplest and most obvious way to represent hierarchical data, but without recursive CTEs it’s not possible to directly express some common queries2, such as retrieving all ancestors of a node.

To illustrate here’s an example with the required basics of a simple hierarchical tree structure:

    id bigint NOT NULL,
    parent_id bigint,
    name character varying (50) NOT NULL,
    CONSTRAINT node_pkey PRIMARY KEY (id),
    CONSTRAINT parent FOREIGN KEY (parent_id)
        REFERENCES node (id)

Every node has a name and an optional foreign key reference to its parent node. It’s very easy to query for a node’s parent node, or all child nodes of a specific node. All other queries are more complicated, or impossible, without a recursive CTE. Let’s take this fictional table:


To get all children of Noah:

SELECT child.name
  FROM node child
    JOIN node parent ON child.parent_id = parent.id
  WHERE parent.name = 'Noah';

To get Lamech’s father:

SELECT parent.name
  FROM node child
    JOIN node parent ON child.parent_id = parent.id
  WHERE child.name = 'Lamech';

You could extend the queries with additional JOINs to also handle grandparents, great-grandparents and so on, but it’s impossible in normal SQL to handle arbitrarily long chains of ancestors or descendants. Now consider this recursive query for all of Japeth’s ancestors:

WITH RECURSIVE ancestors (id, name, parent_id) AS (
  SELECT id, name, parent_id
    FROM node
    WHERE name = 'Japeth'
  SELECT parent.id, parent.name, parent.parent_id
    FROM node parent
      JOIN ancestors child ON parent.id = child.parent_id
SELECT name FROM ancestors;

The second SELECT is a bit confusing. Here’s what happens: At first the imaginary temporary table contains only one row, corresponding to Japeth, the starting point for the following steps. The second SELECT then, for every node already in ancestors (now aliased to child, because these are all children considered so far), finds its parent from node and adds it to the ancestors.

So at every step the ancestors imaginary temporary table contains a set of all parents discovered so far, and all their parents (that is, all nodes that consider the existing nodes their child) are added until no more new parents are discovered.

A common variant is to include a path length variable to quantify the degree of relationship (and for example also be able to exactly query for all paths of a specific length). Another technique is to not pass through the entire row in the CTE, but only operate on the primary keys and then JOIN for the remaining columns later. Let’s look at an example for all descendants of Methuselah:

WITH RECURSIVE descendants (id, n) AS (
  SELECT id, 0
    FROM node
    WHERE name = 'Methuselah'
  SELECT child.id, parent.id+1
    FROM node child
      JOIN descendants parent ON parent.id = child.parent_id
SELECT descendants.n, node.name
  FROM descendants
    LEFT JOIN node ON descendants.id = node.id;

You should see a common pattern to keep in mind: In the second SELECT the node object (called child) conceptually matches our CTE name (descendants), while the CTE reference is the reverse (parent)! We’re adding a node child for every parent already in the CTE.

Bonus Round: CTEs in Django

The django-cte package allows using CTEs in the Django ORM with the normal query API. The equivalent Django code for the last example looks like this:

# model.py
from django.db import models
from django_cte import CTEManager

class Node(models.Model):
  objects = CTEManager()
  id = AutoField(primary_key=True)
  name = models.CharField(max_length=50, null=False)
  parent = models.ForeignKey("self", null=True, on_delete=CASCADE)
from django.db.models import Value, IntegerField
from django_cte import With
from .models import Node

descendants = With.recursive(
  lambda cte: Node.objects.filter(
      n=Value(0, output_field=IntegerField())
      cte.join(Node, parent_id=cte.col.id).values(
        n=cte.col.n + Value(1, output_field=IntegerField())

descendants_set = descendants.join(
  Node, id=descendants.col.id

A few observations to keep in mind:

  • In Django all database expressions need to start with a QuerySet.
  • The django_cte.With object wraps a QuerySet to allow additional methods, including with_cte() which adds the common table expression. Remember that in SQL the CTE goes before the main query (the QuerySet), which might be confusing here.
  • In order to map the self-referential nature of WITH RECURSIVE to Python syntax, django_cte.With.recursive() takes a callable that produces a QuerySet. The callable receives a reference to the CTE.
  • django_cte.With.recursive() needs to be JOINed to the underlying Model to be useful. You also need to .annotate() any computed columns from your CTE to use them outside of the CTE.

  1. Don’t. Instead, please read Falsehoods Programmers Believe About Names 

  2. Workarounds include nested sets and materialized paths, but with additional requirements on the data structures and some maintenance cost on updates 

Showing SQL Queries with Pytest and Django

I have a Django based project, and am doing unit tests with py.test. To debug a test failure it’s sometimes useful to see the actual SQL queries that Django emitted, which is surprisingly hard. I assumed that that would be such an obvious and common need, that a simple switch (for pytest-django) or easy plugin would exist to simply output SQL queries as they are executed.

It is a common need alright (1, 2, 3), but the correct solution is surprisingly unwieldy1.

For one, there is no existing helper or plugin. There are helpers and plugins to count queries and assert a certain query count, which as a side effect track all queries and print the executed queries on query count assertion failure, but I’ve yet to find any case where that would be useful to me. More importantly it’s useless for the exact case here: The stored list of queries is only printed if the expected query count is not matched, not in any other case, such as, say, a failing unit test which you’d want to debug by inspecting the queries that were executed.

Therefore: Fuck it, let’s do it live. Django tracks all queries in the connection object, but in general only if DEBUG=True. For various reasons, tests are executed with DEBUG=False, which is a good thing, since you want to test close to production. Django does provide a context helper to temporarily enable query tracking on a connection which we’ll use instead.

Putting it together, we need to transform a humble test such as

def test_frobnicate_foo(foo):
    assert foo.frobnicate()


def test_frobnicate_foo(foo):
    from django.db import connection
    from django.test.utils import CaptureQueriesContext
    with CaptureQueriesContext(connection):
        assert foo.frobnicate(), connection.queries[0]['sql']

in order to see the value of the first SQL query in case of assertion failure.

At some point someone™ should write a generic plugin to do that.

  1. There are several incorrect solutions on StackOverflow, such as the one that starts with “First, subclass TestCase”, which doesn’t apply to py.test, or the ever helpful “try using django-debug-toolbar”, which doesn’t apply to unit tests in general.