Henryk Plötz

Docker Best Practices: Read-Only Containers

“It works on my machine.”
“OK, we’ll ship your machine.”
The origins of Docker 😉

The core value proposition of Linux containers (e.g. Docker) is to provide an isolated, repeatable execution environment for software. To this end, a container image provides a fully prepared root filesystem, together with a recommended execution command line, that are suitable for running one instance of the software, called a container.

The distinction is important: An image is an immutable artifact, fully checksummed, possibly signed, and used to create an arbitrary number of container instances, that all behave identically when executed. The glue in-between is the container runtime which sets up containers from images to resemble mostly normal-looking computing systems that can execute the software stored on their root filesystems.

Containers look like normal environments

One of the key ideas involved here is that the environment inside a container looks so much like a normal (Linux) environment that most software will run unaltered. In the general case, programs require no modification, and no knowledge of containers at all, to successfully run in a containerized environment. One concession to this execution model, that the runtime makes, is that it will allow writes to the container root filesystem (subject to normal access control permissions) since this is something that is usually possible in normal environments. Obviously a container must not be allowed to write to the underlying image (affecting other containers based on the same image), so these writes are private to each container.

How each container runtime achieves this is an implementation detail that’s not relevant to the general discussion. In the case of Docker on Linux, it will use union mounts to create a stack with the common read-only image below and an empty container-specific read-write directory at the top. Writes go to the container-specific directory, while reads come either from there or from the image.

In fact, this mechanism is applied recursively: The “image” is also just a stack of union mounts of different layers. The temporary directory of a container can be committed to become the top-most layer in a new image. This is how container images are originally built in the Docker model.

This is, however, not the only conceivable implementation. It’s fully possible that a runtime would just unpack the container image into an empty directory, which then serves as the container root directory, with no union mount necessary.

Containers are updated by updating images

One container is always bound to exactly one image, constituting its root directory. There is no defined way to change (e.g. update) the image underlying a container. The solution to this is to just create a new container based on a new (updated) image and remove the old container. This works fine for software that is entirely stateless. The container is just an ephemeral artifact to provide the necessary environment for the code, and can be started and stopped and removed at will (usually under control of some container management system). This is true even if the software keeps state: All mutable state should be stored in volumes (usually implemented as bind mounts) outside of the container (and image). Usually the runtime even provides for temporary directories (on Linux: tmpfs) that are valid for only one boot and not persisted anywhere.

This convention provides for very clean separation of concerns:

Executable code is in the image: read-only, checksummed, updates under control of the container manager
Mutable state is on volumes: read-write, outside the container, can be backed up separately
Temporary state is in tmpfs, not persisted and handled automatically

Containers that modify themselves are problematic

Except for one problem: The container root filesystem itself is still read-write. Badly written, or misconfigured, software could still store state to the container itself. Worse, it might even modify code paths in the container. This has two possible consequences:

State stored in the container is lost on container re-creation (e.g. due to image update)
The execution environment is no longer repeatable, since it might depend on a previous execution of the container (different from other instances of the same image)

The solution is simple, but comes with a slew of consequences: Make the container root filesystem read-only. Docker has the --read-only command-line flag, Docker Compose has the read_only: true service configuration. In Docker, this has the effect of creating a container that contains just the image as root filesystem, and volume mounts, but cannot be written to. Self-evidently this is the right thing to do: All mutations should either go to mutable or temporary state, and the container should only contain code or immutable data and must never be changed anyway. Personally, one of my biggest gripes with Docker is that this is not the default mode of operation.

Strategies for handling read-only containers

Since read-only containers are not the Docker default, there are a lot of container images out there (many of them official Docker Hub images of their respective projects) that won’t work out of the box in read-only mode. And some won’t work at all.

Runtime directories

The most common –and legitimate– reason for a container image not working in read-only mode is due to the software requiring access to a writable /run or /var/run directory. Per Filesystem Hierarchy Standard, this should contain boot-level temporary state and it’s fully appropriate for software to need to write there. Unfortunately the current container image specification has no way to declare that a container image wants these directories, so they cannot be provided by the runtime automatically.
Solution: Find out which directories the application needs and use a tmpfs mount. For example for PostgreSQL in Docker Compose: tmpfs: ["/tmp", "/var/run/postgresql"]

Temporary state

tmpfs is also the solution to many other similar problems. Some software will want to write temporary state to weird directories and you’ll have to analyze the startup error message it gives and find out what it wants. For example I have Keycloak running with /opt/keycloak/data/tmp:mode=1777. At this stage you should also distinguish whether it really is temporary state or more likely mutable state to be kept across invocations and versions, which should be a normal data volume.

Commingled code

Some software, this is especially common with PHP projects, likes to commingle code and data, by writing to its own installation directory. Sometimes there are configuration directives you can set to point it to write somewhere else, into a data volume.
In some cases this problem cannot be solved at all, WordPress being among them. In that case, the container image is just used as a template for initializing a data volume with the code, and then code updates happen only within the data volume.

External configuration override

Some container images are prepared so that they apply external configuration to the installed software, by modifying configuration or startup files on container startup, before calling the target software itself. Preferably all such customization only happens in /tmp/ (see above), or can be configured to only happen in such temporary directories. This may be difficult in some cases, so sometimes you’ll want to apply the configuration by hand, externally to the container, and mount the resulting configuration file read-only into the container, disabling the built-in customization (e.g. when handling /etc/nginx/).

Lost causes

And finally, some container images are built in batshit reckless mode, running something like apt update && apt install foo && foo (or equivalent) at startup. That is, they install the software or some dependencies only at container runtime, sometimes every time. This of course throws out all benefits of reproducibility and reliability. The container image doesn’t even contain the software it’s supposed to run. Your only choice is to not use these container images at all and use the fallback below. And maybe give feedback to the image creator.

Benefits of read-only containers

Having the container root filesystem read-only fits into the general scheme of containerized deployment, and frankly should’ve been the default, and offers operational benefits:

Implementing write-xor-execute on the filesystem level (in combination with all data mounts being set to noexec, which is the default). This is important for a consistent, repeatable, reproducible (in conjunction with all configuration data) execution flow. All code executed as part of the container is part of the container image. When analyzing the execution environment for software vulnerabilities (f.e. in an SBOM approach), only the image needs to be analyzed. No code can come to be executed except through code paths present in the image.
Discouraging expensive container modifications at startup (where installing software is the worst case), to reduce startup time. An ideal container image is fully set up and prepared and will directly execute the target application at startup.
Cleanly separating code from data allows for targeted backup policies. Immutable code images can be backed up separately (for example at the registry level) and do not need to be part of expensive data backups.

Fall-back strategy for read-only containers

Since read-only images are not the Docker default, and many non-compliant images exist, you’ll need a strategy to handle these broken images. The general idea here is to create a new, local, image that is ready to run in read-only mode. We’ll have to bite one bullet, though, and in this solution we’ll have to do away with the separation between container build environment and run environment, at least in the simplest case. If you do have a fully set up container build environment and registry, you can use that instead of the integrated approach below.

Approach: Use Docker Compose to build a local image based on the original image that includes all container startup modifications in the image itself, and then execute this as a normal read-only container.

# compose.yml
services:
  app:
    pull_policy: build
    build:
      pull: true
      context: src/app
    read_only: true
    # Other configuration here

# src/app/Dockerfile
FROM original-app

RUN mkdir /foo/etc/pp  # Do what the original container image would do on container start

ENTRYPOINT /usr/bin/app  # Override the entrypoint to bypass the original container self-modification

Starting this Docker Compose project will build a custom image that includes the runtime modifications as part of the image creation, and then start this image in read-only mode. The exact modifications and command-line are specific to the individual case and may require some reverse engineering of the original image.

Or, of course, you can fully abandon the original image and just build a proper one yourself. Maybe send a pull request to the original maintainer then.

How To Use PyJWT With Django In A Resource Server And Still Keep Parts Of Your Sanity

The OAuth2 spec cleanly separates the role of Authorization Server (AS) from that of Resource Server (RS). The role of the AS, and the whole OAUTH2 dance, is to get an access token that will be accepted by a RS.

Figure 1 from the OAuth2 spec RFC 6749, shows four parties. The Client communicates with the Owner to get Authorization, with the Authorization Server to get an Access Token, and then uses the Access Token with the Resource Server to get access to the service. — Figure 1 from the OAuth2 spec RFC 6749

It’s puzzling. It should be easy, nay, trivial, to implement the Resource Server side in Django, yet it’s not. There are several libraries whose description can be interpreted as “implementing OAuth2”, yet what they all do is implement the Authorization Server side. I want to consume access tokens, not issue them!

Donnie Darko "I made a new friend" meme template.
Patient: "I made a new OAuth2 Implementation for Python", Therapist "OAuth2 implementation, or authorization server", Patient (defeated look): "Authorization server"

(Then of course there’s djangorestframework-simplejwt whose sole, primary, and exclusive functionality is to implement the most stupid JWT pattern known to humankind.)

Now, in theory the access token could be anything. But common Authorization Server implementations (keycloak, authentik, various hosted services) have converged on issuing signed JSON Web Tokens. So what the resource server needs is to be configured with the key to verify the tokens. We could conceivably hard code the key in source or configuration, but that is a bit of a hassle, and anyway, this is the third decade of the third millennium, quite frankly we shouldn’t have to. All server implementations offer a JWKS endpoint where the currently valid keys can be queried (and even a full autodiscovery endpoint, to discover the JWKS endpoint). An implementation of a resource server should, in theory, only need to be pointed at the JWKS endpoint and everything should just work.

The PyJWT documentation has something like this for the purpose:

import jwt

token = "..."
jwks_client = jwt.PyJWKClient(url)
signing_key = jwks_client.get_signing_key_from_jwt(token)
data = jwt.decode(token, signing_key.key, algorithms=["RS256"])

We want to authorize requests. Every time a new request comes in. That’s what a Resource Server does. It uses the provided token to check authorization. And apparently the documentation seems to suggest the correct way to do this is to fetch the keys from the JWKS endpoint on every request.

WTF?

We’ll need some caching. The documentation is mum on the topic. The implementation however is not. Turns out, they have implemented a cache. Only, they have implemented it on the PyJWKClient object itself. And there’s no easy way to hook up a real cache (such as Django’s).

The usual flow for normal Python web frameworks is that no object survives from request to request. Each request gets a clean slate. They may run in the same process sequentially. In different threads in the same process. In multiple processes. Or even async in the same thread in the same process. With the given example code we would be hitting the authorization server JWKS endpoint for every incoming request, adding huge latencies to processing.

In order to retain even a shred of sanity, we have no choice but to turn the JWKClient into a kind of singleton. It looks like this:

import jwt
from django.conf import settings


_jwks_client: Optional[jwt.PyJWKClient] = None


def get_jwks_client() -> jwt.PyJWKClient:
    # PyJWKClient caches responses from the JWKS endpoint *inside* the PyJWKClient object
    global _jwks_client
    if _jwks_client:
        return _jwks_client
    _jwks_client = jwt.PyJWKClient(settings.JWKS_URI)
    return _jwks_client

With this definition in place you can get the signing key as signing_key = get_jwks_client().get_signing_key_from_jwt(token) and will at least get some caching within a process, until the server decides to spawn a new process.

Then, to hook up authentication into Django Rest Framework you’ll do something like this (where User.from_token needs to be something that can turn a verified JWT dict into a User object):

def authenticate_request(request):
    if header := get_authorization_header(request):
        match header.split():
            case b"Bearer", token_bytes:
                token = token_bytes.decode("us-ascii", errors="strict")
                signing_key = get_jwks_client().get_signing_key_from_jwt(token)
                data = jwt.decode(token, signing_key.key, algorithms=["RS256"])

                if data:
                    return User.from_token(data), data

Making Good Bug Reports

Many, many years ago, this was with Bugzilla in the early 2000s, I got my first automated lecture on what constitutes a good bug report. I probably didn’t pay attention. Since then, I’ve seen this list countless times, in various levels of detail, across a broad array of systems:

What did you do?
What happened?
What did you expect to happen?

Over the last few years I’ve come to realize that this list is irreducible, if you’re losing one item you lose important context, and represents a kind of deep wisdom:

What Did You Do? — If we cannot see the steps that brought you into the situation, it’ll be hard to find the place in the program where it happens. It’ll also set up our mental model of the program in question to see what we think should happen.
Preferably this should be detailed and reliable enough to reproduce the problem on our side. Things that cannot be reliably reproduced are very hard to fix, because you’ll never truly know if they’re gone.
What happened? — This gives context on what happened for you, which might be different for us, indicating some other issue. In some cases this is what we thought should happen, so this also gives a clear statement to set up the next point.
What did you expect to happen? — Stating how your expectation differs from reality is what makes this a bug. You’re not reporting issues where the system does what you expected it should do. But this expectation might differ from what we were expecting. The issue need not be in the code or the implementation, but might be somewhere else. Maybe the documentation gave you a wrong idea on what should happen?

Sometimes a bug report can be succinct but still contain all three items: “I clicked on save. It did not save. I expected it to save.” Though in this case the first part really should be longer, because this is probably something that only happens under certain circumstances. And even if part 3 is only “I expected it to work”, that’s good to write out.

Bug reports consisting of a single screenshot, for example of an error message, are often not helpful. They, more or less, cover part 2, but leave out important context. It may not be obvious from the screenshot on how to get there. And it’s as likely as not that we think that this is the expected behavior. You should state why you think this error message is, as it were, in error.

The three parts of a good bug report are interlocking. Like describing the way to the train station to a stranger. You’re not going to describe it as “Turn left second street, go right first street, go right third street.” You’re giving context: “Go down this street and turn left at the second intersection, right behind the flower shop. You should see the church in front of you, turn immediately right and go into the small alley. If you then turn right at the third street you should see the train station in the distance.”

This is redundant. But redundancy is good. It allows for error checking and correction. It allows for there to be errors in both the environment and in its mental model or description thereof.

So, repeat after me: What did you do? What happened? What did you expect to happen?

Docker Deployment Best Practices

Given: There’s a CI system that automatically builds docker images from your VCS (e.g. git), we use self-hosted gitlab.
Goal: Both initial and subsequent automated deployments to different environments (staging and production).

Rejected Approaches

Most existing blog articles and howtos for this use case, specifically in the context of gitlab, tend to be relatively simple, relatively easy, and very very wrong. The biggest issue is with root access to the production server. I believe that developers (and the CI/CD system) should not have full root access to the production system(s), to retain semblance of separation in case of breaches. Yes, sure, a malicious developer could still check-in bad code which might eventually get deployed to production, but there is (should be) a review process before that, and traces in the VCS.

And yet, most recommendations on how to do deployment with gitlab circle around one of two approaches:

Install a gitlab “runner” on your production server. That is, an agent which gets commands from the gitlab server and executes them. This runner needs full root access (or, equivalently, docker daemon access), thus giving the gitlab server (and anyone who has/gains control over it) full root access to the production system(s).
This approach also needs meticulous management of the different runners, since they are now being used not just for build purposes but also have a second, distinct, duty for deployment.
Use your normal gitlab runners that are running somewhere else, but explicitly give them root access to the target servers, e.g. with a remote SSH login.
Again, this gives everybody in control of the gitlab server full production access, as well as anybody in control of one of the affected runners. Usually this is made less obvious by “only” giving docker daemon access, but that’s still equivalent to full root access.

There’s variants on this theme, like using Ansible for some abstraction, but it always boils down to somehow making it so that the gitlab server is capable of executing arbitrary commands as root on the production system.

Our Approach

For container management we’re going to use docker compose, the new one, not docker-compose. A compose.yaml file (with extensions, see below) is going to fully describe the deployment, and compose will take care of container management for updates.

Ideally we want to divide the task into two parts:

Initial setup
Continuous delivery

For the initial setup there’s not a pressing reason for full automation. We’re not setting up new environments all the time. There’s still some best practices and room for automation, see below, but in general it’s a one-time process executed with high privileges.

The continuous updates on the other hand should be fast, automated, and, above all, restricted. An update to a deployed docker application does exactly one thing: pull new image(s) then restart container(s).

Restricted SSH keys for update deployment

Wouldn’t it be great if we had an agent on the production server that could do that, and only that? Turns out, we have! Using additional configuration on ~/.ssh/authorized_keys we can configure a public key authenticated login that will only execute a (set of) predefined command(s), and nothing else¹. And since sshd is already running and exposed to the internet anyway, we don’t get any new attack surface.

The options we need are:

restrict to disable, roughly, all other functionality
command="cd ...; docker compose pull && docker compose up -d" which will make any login with that key execute only this command (you’ll need to fill in the path to cd into).

Using docker compose

In order for this to seamlessly work, there’s some best practices to follow when creating the environment:

All container configuration is handled by the docker compose framework
- Specifically: docker compose up just works.
  No weird docker compose -f compose.foo.yaml -f compose.bar.yaml -e WTF_AM_I_DOING=dunno up incantations.
The docker compose configuration should itself be version controlled
The containers come up by themselves in a usable configuration, and can handle container updates gracefully
- For example in Django, the django-admin migrate call must be part of the container startup
- In general it’s not allowed to need to manually execute commands in the containers or the compose environment for updates. You’re allowed to require one necessary initialization command on first setup, under extenuating circumstances only.
There’s also good container design (topic of a different blog post) with regards to separation of code and data

Good docker compose setup

There’s two ways to handle the main compose configuration of a project: As part of the git repository of one of the components, or as a separate git repository by itself.

The first approach applies if it’s a very simple project, maybe just one component. If it contains only the code you wrote, and possibly some ancillary containers like the database, then you’ll put the compose.yaml into the root directory of the main git repository. This also applies if your project consists of multiple components maintained by you, but it’s obvious which one is the main one (usually the most complex one).
Like if you have a backend container (e.g. Python wsgi), a frontend container (statically compiled HTML/JS, hosted by an nginx), a db container (standard PostgreSQL), maybe a cache, and some helper daemon (another Python project). Three of these are maintained by you, but the main one is the backend, so that’s where the compose.yaml lives.
For complex projects it makes sense to create a dedicated git repository that only hosts the compose file and associated files. This specifically applies if the compose file needs to be accompanied by additional configuration files to be mounted into the containers. These usually do not belong in your application’s git repository.

The idea here is that the main compose.yaml file (using includes is allowed) handles all the basic configuration and setup of the project, independent of the environment. Doing a docker compose up -d should bring up the project in some default state configured for a default environment (e.g. staging). Additional environment specific configuration should be placed in a compose.override.yaml file, which is not checked into git and which contains all the modifications necessary for a specific environment. Usually this will only set environment variables such as URL paths and API keys.

Additional points of note:

All containers should be configured read_only: true, possibly assisted by tmpfs: ["/run","/var/run/someapp"] or similar. If that’s not possible, go yell at the container image creator².
All configuration that is mounted from the outside should be mounted read-only
Data paths are handled by volumes
The directory name is the compose project name. That’s how you get the ability to deploy more than one instance of a project on the same host. The directory name should be short and to the point (e.g. frobnicator or maybe frobnicator-staging).
Ports in the main compose.yaml file are a problem, since port numbers are a global resource. A useful pattern is to not specify a port binding in the compose.yaml file and instead rely on compose.override.yaml for each deployment to specify a unique port for this deployment. That’s one of the few cases where it’s acceptable to absolutely require a compose.override.yaml for correct operation, and it must be noted in the README.

Putting It All Together

This example shows how to set up deployment for project transmogrify/frobnicator, hosted on gitlab at git.example.com, with the registry accessible as registry.example.com, to host deploy-host, using non-root (but still docker daemon capable) user deploy-user.

Preliminaries

On deploy-host, we’ll create a SSH public/private key to be used as deploy key for the git repository containing the main compose.yaml and configure docker pull access. This probably only needs to be done once for each target host.

ssh-keygen -t ed25519

Just hit enter for default filename (~/.ssh/id_ed25519) and no passphrase. Take the resulting public key (in ~/.ssh/id_ed25519.pub) and configure it in gitlab as a read-only deploy key for the project containing the compose.yaml (under https://git.example.com/transmogrify/frobnicator/-/settings/repository).

We’ll also need a deploy token for docker registry access. This should be scoped to access all necessary projects. In general this means you’ll want to keep all related projects in a group and create a group access token under https://git.example.com/groups/transmogrify/-/settings/access_tokens. Create the access token with a name of deploy-user@deploy-host, role Reporter and Scope read_registry.

Caveat 1: docker can only manage one set of login credentials per registry host. Either use non-privileged/user-space docker daemons separated by project (e.g. with different users on the deploy host, each one only managing one project), which is a topic of a different blog post. Or use a “Personal” Access Token for a global technical user which has access to all the necessary projects instead.
(There’s a third option: Create multiple .docker/config.json files and set the DOCKER_CONFIG environment variable accordingly. This violates the “docker compose up should just work” requirement.)

Caveat 2: docker really doesn’t want to store login credentials at all. There’s a couple of layers of stupidity here. Just do the following (note: this will overwrite all previously saved docker logins on this host, but you shouldn’t have any):

mkdir -p ~/.docker
echo '{"auths": {"registry.example.com": {"auth": ""}}}' > ~/.docker/config.json

Then you can do a normal login with the deploy token and it’ll work:

docker login registry.example.com

First deployment / Setup

Clone the repository and configure any overrides necessary. Then start the application.

git clone ssh://git@git.example.com/transmogrify/frobnicator.git
cd frobnicator
vi compose.override.yaml  # Or whatever is necessary
docker compose up -d

Your project should now be running. Finish any remaining steps (set up reverse proxy etc.) and debug whatever mistakes you made.

Set up for autodeploy

On deploy-host (or a developer laptop) generate another SSH key. We’re not going to keep it for very long, so do it like this:

ssh-keygen -t ed25519 -f pull-key  # Hit enter a couple of times for no passphrase

Also retrieve the SSH host keys from deploy host (possibly through another way):

ssh-keyscan deploy-host

On deploy-host, add the following line to ~deploy-user/.ssh/authorized_keys (where ssh-ed25519 AAA... is the contents of pull-key.pub from step 1):

restrict,command="cd /home/deploy-user/frobnicator; docker compose pull && docker compose up -d" ssh-ed25519 AAA....

In gitlab, on group level, configure variables (https://git.example.com/groups/transmogrify/-/settings/ci_cd):

Name	Value	Settings
`SSH_AUTH`	`-----BEGIN OPENSSH PRIVATE ...` (contents of `pull-key` from step 1)	File, Protected
`SSH_KNOWN_HOSTS`	results of step 2
`SSH_DEPLOY_TARGET`	`ssh://deploy-user@deploy-host`

You may use the environments feature of gitlab here, which will generally mean a different set of values per environment (and then choosing the environment in the job in the next step). Afterwards, delete the temporary files pull-key and pull-key.pub from step 1.

In your project’s or projects’ .gitlab-ci.yaml file (this is in the code projects, not necessarily the project containing compose.yaml), add this (the publish docker job is outside of the scope of this post):

stages:
  - build
  - deploy

# ...

deploy docker:
  image: docker:git
  stage: deploy
  cache: []
  needs:
    - publish docker
  before_script: |
    mkdir -p ~/.ssh
    echo "${SSH_KNOWN_HOSTS}" > ~/.ssh/known_hosts
    chmod -R go= ~/.ssh "${SSH_AUTH}"
  script: ssh -i "${SSH_AUTH}" -o StrictHostKeyChecking=no "${SSH_DEPLOY_TARGET}"

In order to handle multiple environments, you can also add

deploy docker:
  # ...
  rules:
    - if: $CI_COMMIT_BRANCH == "dev"
      variables:
        ENVIRONMENT: staging
    - if: "$CI_COMMIT_BRANCH =~ /^master|main$/"
      variables:
        ENVIRONMENT: production  
  environment: "${ENVIRONMENT}"

Voila. Every time after a docker image has been built, a gitlab runner will now trigger a docker compose pull/up, with minimal security impact since that’s the only thing it can do.

Addenda

The preliminaries and initial setup can be automated with Ansible.

You can put the gitlab-ci configuration into a common template file that can be referenced from all projects. For example, we have a common tools/ci repository, so the only thing necessary to get auto deployment is to put

stages:
  - publish
  - deploy

include:
  - project: tools/ci
    file: docker-ssh.yml

into a project and do the deploy-host setup and variable definitions (well, and add the other include files that handle the actual docker image building).

man authorized_keys, section “AUTHORIZED_KEYS FILE FORMAT” ↩︎
Their image is bad and they should feel bad. ↩︎

Should I Use JWTs For Authentication Tokens?

No.

Not satisfied? Fine, fine. I’ll write a longer answer.

Let’s talk about what we’re talking about. JWT stands for JSON Web Tokens, a reasonably well defined standard for authenticated tokens. Specifically they have a header with format information, a payload, and a signature or message authentication code. The core idea is that whoever has the corresponding verification key can verify that the payload is authentic and has not been altered. What they do with that information is up to them.

The JWT spec (RFC 7519) makes suggestions by providing a few well-known registered claim names: issuer, audience, subject, expiration time, etc. A common usage pattern is that, after verifying the authenticity against whatever trust relationship they have with the issuer, the recipient checks whether they are the intended audience (if any is specified) and the expiration time has not yet passed, and then take the subject as an authenticated identity of the bearer of the token.

It’s perfectly designed for bearer token authentication! Or is it? Let me be clear: JWT as authentication tokens are constructed for Google/Facebook scale environments, and absolutely no one who is not Google/Facebook needs to put up with the ensuing tradeoffs. If you process less than 10k requests per second, you’re not Google nor are you Facebook.

The core benefit, proponents will tell you, is that the recipient of a JWT doesn’t need to connect to the user database to verify the token authenticity and render its service. In a large installation, like Google’s, that means that the JWT issuer, the authentication service, can be a dedicated service that is managed and scaled like other services, and is the only service that needs to access the centralized user database. All other services can act on the information stored in the JWT alone, and don’t need to go through the user database, which would represent a choke point.

What about logout/session invalidation? Well, in order for this model to work, the authentication token should have a fairly short lifetime. Maybe 5 minutes, max. The client is also issued a second token, the so-called refresh token, with which it can request a new authentication token from the authentication service. This gives the authentication service a chance to consult the user database to see whether the user or a specific session has been blocked in the meantime.

Here’s the twist that is rarely, if ever, spelled out: In this setup the refresh token, not the authentication token, is the real session token. The refresh token represents the session with the authentication service (which can be revoked), while the authentication tokens are just derived credentials to be used for a few requests at most. The beauty, from Google’s point of view, is that this delegates keeping the session alive to the client, i.e. not Google’s servers. Oh and by the way, the refresh token can be, and usually is, opaque, since it’s only ever consumed by the same service that creates it. That reduces a lot of complexity, by just using an opaque identifier stored in a database.

Now, let’s assume you are not Google. Check which of these apply to you:

You wanted to implement log-out, so now you’re keeping an allowlist of valid JWTs, or a denylist of revoked JWTs. To check this you hit the database on each request.
You need to be able to block users entirely, so you check a “user active” flag in the database. You hit the database on each request.
You need additional relationships between the user object and other objects in the database. You hit the database on each request.
Your service does anything at all with data in the database. You hit the database on each request.

Congratulations, if you confirmed any of the items above, you don’t need JWTs. You’re hitting the database anyway, and I’m pretty sure that you only have one database which stores both your user profiles and your application data. By just using a “normal” opaque session token and storing it in the database, the same way Google does with the refresh token, and dropping all JWT authentication token nonsense, you stand to reap these great benefits:

No weird workarounds (allow/denylist) for shortcomings of JWT as authentication token
Greatly reduced complexity. No need to manage a secure JWT signing/authentication key
You get to pass on some interesting bugs.

Just use the normal session mechanism that comes with your web framework and that you were using before someone told you that Google uses JWT. It has stood the test of time and is probably fine.

If you need something to do to make you feel like you’re running a big deployment, you can probably configure your session mechanism to use ~~redis~~valkey to store the session data. You’re still going to use the authenticated user id to query the database, but for unauthenticated requests it may be faster/use less resources. It might not be. You’ll have to tune and measure that.

On The End of Bitcoin

[This is a repost of a thread of mine from a dead social media platform. The original posting date was May 24th, 2022. I have not added new content.]

Is there some scientific literature/scenario planning on the End of Bitcoin? With the next two halvings mining income will shift significantly from block reward to transaction fees. Even though this has been going on for some time, we’re now closing in on the point where …

… the block reward is lower than the energy costs (exponential functions, yeah). This must lead to a significant rise in transaction fees (as in: factor 20). This will significantly discourage Bitcoin usage, which will negatively affect price.

Low price and low block reward means that miners will switch off mining capacity. Now, the difficulty will adapt, but this makes the network brittle. The reason for this: A mining operation consists of setup costs, access to electricity, and operational costs.

A switched-off mining rig already has the first two components available (and, likely, amortized), so it becomes standby. It is mining capacity that exists, and can be used at a moment’s notice, but is invisible to the difficulty regulation algorithm.

I see several different outcomes. An unstable equilibrium could arise, where capacity is switched on and off based on market rates. But, at some point between here and there, there could be a point …

… where a 51% attack (due to low difficulty and high standby capacity) is both feasible and lucrative. This would completely destroy trust in the system (since it can be repeated at any time) and crater the price, effectively ending Bitcoin.

In the meantime, high transaction fees making it unattractive to trade Bitcoin are also a problem for the miners, who are paid in Bitcoin. If no one wants to buy, they can’t pay their electricity bills. Now, the Lightning people will say this is no problem on Lightning.

But even so you’ll still need a couple of basic transactions to maintain the Lightning network. You know who doesn’t need to pay Bitcoin transaction fees? Miners. They would take their earnings (in Bitcoin) and convert them into Lightning and operate as Lightning hubs.

So another end result (though brittle, see above) has the current Bitcoin mining titans operating as quasi-centralized Lightning brokers.

This would theoretically be feasible indefinitely, but at some point, I hope, we will get the 5 companies that are running Bitcoin/Lightning by then to simply switch off Bitcoin and use a normal database, effectively turning them into regular (and: regulated) banks.

Understanding Common Table Expressions in SQL

One of the lesser known features of modern SQL are so-called “Common Table Expressions” (CTE) or “WITH queries”. I’ll explain the mental model that helped me make sense of them, and how to use them to execute recursive queries. Afterwards I’ll show how to apply these techniques in Django.

Syntactically a CTE consists of one or more statements marked with WITH and placed before the statement they relate to, within the same SQL query. Conceptually these additional statements behave as if defining a view, or temporary table(s), that is valid only within this one SQL query.

The intended use is for simplifying complex or repeated operations and pulling them out of the main query. Let’s say you have normalized your database beyond all reason and have the following schema for storing names¹:

CREATE TABLE name (
    id bigint NOT NULL,
    name character varying(50) NOT NULL,
    CONSTRAINT name_pkey PRIMARY KEY (id)
);
CREATE TABLE person (
    id bigint NOT NULL,
    first_name_id bigint NOT NULL,
    last_name_id bigint NOT NULL,
    CONSTRAINT person_pkey PRIMARY KEY (id),
    CONSTRAINT first_name FOREIGN KEY (first_name_id)
        REFERENCES name (id),
    CONSTRAINT last_name FOREIGN KEY (last_name_id)
        REFERENCES name (id)
);

Given this schema, you’d have to use something like SELECT CONCAT(first.name, ' ', last.name) everywhere you wanted a full name, together with a join along the foreign keys.

Even this small example becomes tiresome pretty fast. It’s worse for more complex cases and gets even more complicated when you consider computed or aggregate functions.

The WITH statement lets you extract the complications from your query and get them over with first. A query for full name substrings could look like this:

WITH full (id, name) AS (
  SELECT 
    p.id AS id,
    CONCAT(first.name, ' ', last.name) AS name
  FROM
    person p
    LEFT JOIN name first
    LEFT JOIN name last
  WHERE
    first.id = p.first_name_id,
    last.id = p.last_name_id
)
SELECT id, name FROM full
WHERE name LIKE '%om Ridd%';

This behaves as if a temporary table named full with columns id, name is created and filled with the results from the first SELECT statement (the CTE), just before executing the second, main, SELECT statement. In the main SELECT you do not need to worry about the JOINs or other details from the inside of the CTE. It will appear as if it were a normal table.

Multiple CTE in one query are possible, and neither the CTE nor the main query are limited to SELECT. The PostgreSQL documentation has an example with DELETE … RETURNING and INSERT INTO … SELECT that moves rows between tables.

But so far, this is only syntactic sugar, an easier way to write something you can also express otherwise. You could achieve the same result with a (possibly materialized) view, though you’d have to modify the database schema. Or you could include everything from the CTE in your main query, but have to take great care not to trip up when using DISTINCT or GROUP BY. Or you have an ORM that will handle all the complicated queries for you and will allow reuse of complex expressions.

Then everything changes when you learn about WITH RECURSIVE. The magic consists of two building blocks: First, a WITH RECURSIVE statement may refer to itself. That is, in the SELECT that fills data into your imaginary temporary table you can read data from the imaginary temporary table. This, in itself, is not very useful, since the table starts out empty. The second building block is extending the CTE SELECT to include data from the start.

The canonical use case of a recursive query looks like this:

WITH RECURSIVE my_cte AS (
  SELECT ... /* base case*/
    FROM somewhere_else
    WHERE ...
  UNION
  SELECT ... /* recursive step */
    FROM somewhere_else JOIN my_cte
    WHERE ...
)
SELECT * FROM my_cte;

The two SELECT statements in the CTE perform the functions of base case and recursive step in the definition of recursion. The semantics of WITH RECURSIVE is such that it repeats the query, storing results in the imaginary temporary table, until no new rows are being added. In other words, the stop condition is implicitly fixed and cannot be altered. You have to make sure that it terminates eventually, for example with a LIMIT clause.

An artificial example would be to create a sequence of numbers:

WITH RECURSIVE numbers (n) AS (
  SELECT 1 AS n
  UNION
  SELECT numbers.n + 1 AS n
    FROM numbers
)
SELECT * FROM numbers LIMIT 10;

This example shows that the base case needn’t necessarily come from another table. You can also see that in most simple cases the final, main, SELECT is just a dummy SELECT * (in this case with an added LIMIT) because most of the heavy lifting is being done in the CTE.

Recursive CTEs allow SQL to do things not possible without. A prime example is operating on trees in adjacency list form. This is the simplest and most obvious way to represent hierarchical data, but without recursive CTEs it’s not possible to directly express some common queries², such as retrieving all ancestors of a node.

To illustrate here’s an example with the required basics of a simple hierarchical tree structure:

CREATE TABLE node (
    id bigint NOT NULL,
    parent_id bigint,
    name character varying (50) NOT NULL,
    CONSTRAINT node_pkey PRIMARY KEY (id),
    CONSTRAINT parent FOREIGN KEY (parent_id)
        REFERENCES node (id)
);

Every node has a name and an optional foreign key reference to its parent node. It’s very easy to query for a node’s parent node, or all child nodes of a specific node. All other queries are more complicated, or impossible, without a recursive CTE. Let’s take this fictional table:

id	name	parent_id
1	Enoch	NULL
2	Methuselah	1
3	Lamech	2
4	Noah	3
5	Shem	4
6	Ham	4
7	Japeth	4

To get all children of Noah:

SELECT child.name
  FROM node child
    JOIN node parent ON child.parent_id = parent.id
  WHERE parent.name = 'Noah';

To get Lamech’s father:

SELECT parent.name
  FROM node child
    JOIN node parent ON child.parent_id = parent.id
  WHERE child.name = 'Lamech';

You could extend the queries with additional JOINs to also handle grandparents, great-grandparents and so on, but it’s impossible in normal SQL to handle arbitrarily long chains of ancestors or descendants. Now consider this recursive query for all of Japeth’s ancestors:

WITH RECURSIVE ancestors (id, name, parent_id) AS (
  SELECT id, name, parent_id
    FROM node
    WHERE name = 'Japeth'
  UNION
  SELECT parent.id, parent.name, parent.parent_id
    FROM node parent
      JOIN ancestors child ON parent.id = child.parent_id
)
SELECT name FROM ancestors;

The second SELECT is a bit confusing. Here’s what happens: At first the imaginary temporary table contains only one row, corresponding to Japeth, the starting point for the following steps. The second SELECT then, for every node already in ancestors (now aliased to child, because these are all children considered so far), finds its parent from node and adds it to the ancestors.

So at every step the ancestors imaginary temporary table contains a set of all parents discovered so far, and all their parents (that is, all nodes that consider the existing nodes their child) are added until no more new parents are discovered.

A common variant is to include a path length variable to quantify the degree of relationship (and for example also be able to exactly query for all paths of a specific length). Another technique is to not pass through the entire row in the CTE, but only operate on the primary keys and then JOIN for the remaining columns later. Let’s look at an example for all descendants of Methuselah:

WITH RECURSIVE descendants (id, n) AS (
  SELECT id, 0
    FROM node
    WHERE name = 'Methuselah'
  UNION
  SELECT child.id, parent.id+1
    FROM node child
      JOIN descendants parent ON parent.id = child.parent_id
)
SELECT descendants.n, node.name
  FROM descendants
    LEFT JOIN node ON descendants.id = node.id;

You should see a common pattern to keep in mind: In the second SELECT the node object (called child) conceptually matches our CTE name (descendants), while the CTE reference is the reverse (parent)! We’re adding a node child for every parent already in the CTE.

Bonus Round: CTEs in Django

The django-cte package allows using CTEs in the Django ORM with the normal query API. The equivalent Django code for the last example looks like this:

# model.py
from django.db import models
from django_cte import CTEManager

class Node(models.Model):
  objects = CTEManager()
  id = AutoField(primary_key=True)
  name = models.CharField(max_length=50, null=False)
  parent = models.ForeignKey("self", null=True, on_delete=CASCADE)

from django.db.models import Value, IntegerField
from django_cte import With
from .models import Node

descendants = With.recursive(
  lambda cte: Node.objects.filter(
      name="Methusaleh"
    ).values(
      "id",
      n=Value(0, output_field=IntegerField())
    ).union(
      cte.join(Node, parent_id=cte.col.id).values(
        "id",
        n=cte.col.n + Value(1, output_field=IntegerField())
      )
    )
)

descendants_set = descendants.join(
  Node, id=descendants.col.id
).with_cte(descendants).annotate(
  n=descendants.col.n
)

A few observations to keep in mind:

In Django all database expressions need to start with a QuerySet.
The django_cte.With object wraps a QuerySet to allow additional methods, including with_cte() which adds the common table expression. Remember that in SQL the CTE goes before the main query (the QuerySet), which might be confusing here.
In order to map the self-referential nature of WITH RECURSIVE to Python syntax, django_cte.With.recursive() takes a callable that produces a QuerySet. The callable receives a reference to the CTE.
django_cte.With.recursive() needs to be JOINed to the underlying Model to be useful. You also need to .annotate() any computed columns from your CTE to use them outside of the CTE.

Don’t. Instead, please read Falsehoods Programmers Believe About Names ↩
Workarounds include nested sets and materialized paths, but with additional requirements on the data structures and some maintenance cost on updates ↩

An Efficient Multi-Stage Build for Python Django in Docker

We’ve recently begun dockerizing our applications in an effort to make development and deployment easier. One of the challenges was establishing a good baseline Dockerfile which can maximize the benefits of Dockers caching mechanism and at the same time provide minimal application images without any superfluous contents.

The basic installation flow for any Django project (let’s call it foo) is simple enough:

Preconditions (click to expand)

The foo project has a Django settings module, which contains suitable default settings, especially with regards to the database connection.
A requirements.txt file lists all project dependencies.
The pip command should be executed in a Python virtual environment (or it may be executed as root in a Docker container).

export DJANGO_SETTINGS_MODULE=foo.settings
pip install -r requirements.txt
python manage.py collectstatic
python manage.py compilemessages
python manage.py migrate

(Note: In this blog post we’ll mostly ignore the commands to actually get the Django project running within a web server. We’ll end up using gunicorn with WSGI, but won’t comment further on it.)

This sequence isn’t suitable for a Dockerfile as-is, because the final command in the sequence creates the database within the container image. Except for very specific circumstances this is likely not desired. In a normal deployment the database is located either on a persistent volume mounted from outside, or in another container completely.

First lesson: The Django migrate command needs to be part of the container start script, as opposed to the container build script. It’s harmless/idempotent if the database is already fully migrated, but necessary on the first container start, and on every subsequent update that includes database migrations.

Baseline Dockerfile

A naive Dockerfile and accompanying start script would look like this:

Preconditions (click to expand)

A requirements.txt with all required Python packages to install exists, as well as a foo.wsgi file to load the WSGI application.

# Dockerfile
FROM python:slim
ENV DJANGO_SETTINGS_MODULE foo.settings
RUN mkdir -p /app
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt gunicorn
RUN python manage.py collectstatic
RUN python manage.py compilemessages
ENTRYPOINT ["/app/docker-entrypoint.sh"]

# docker-entrypoint.sh
cd /app
python manage.py migrate
exec gunicorn --bind '[::]:80' --worker-tmp-dir /dev/shm --workers "${GUNICORN_WORKERS:-3}" foo.wsgi:application

(The --worker-tmp-dir bit is a workaround for the way Docker mounts /tmp. See Configuring Gunicorn for Docker.)

This approach does work, but has two drawbacks:

Large image size. The entire source checkout of our application will be in the final docker image. Also, depending on the package requirements we may need to apt-get install a compiler or development package before executing pip install. These will then also be in the final image (and on our production machine).
Long re-build time. Any change to the source directory will invalidate the Docker cache starting with line 6 in the Dockerfile. The pip install will be executed fully from scratch every time.

(Note: We’re using the slim Python docker image. The alpine image would be even smaller, but its use of the musl C library breaks some Python modules. Depending on your dependencies you might be able to swap in python:alpine instead of python:slim.)

Improved Caching

Docker caches all individual build steps, and can use the cache when the same step is applied to the same current state. In our naive Dockerfile all the expensive commands are dependent on the full state of the source checkout, so the cache cannot be used after even the tiniest code change.

The common solution looks like this:

# Dockerfile
FROM python:slim
ENV DJANGO_SETTINGS_MODULE foo.settings
RUN mkdir -p /app
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt gunicorn
COPY . .
RUN python manage.py collectstatic
RUN python manage.py compilemessages
ENTRYPOINT ["/app/docker-entrypoint.sh"]

In this version the pip install command on line 7 can be cached until the requirements.txt or the base image change. Re-build time is drastically reduced, but the image size is unaffected.

Building with setup.py

If we package up our Django project as a proper Python package with a setup.py, we can use pip to install it directly (and could also publish it to PyPI).

If the setup.py lists all project dependencies (including Django) in install_requires, then we’re able to execute (for example in a virtual environment):

pip install .

This will pre-compile and install all our dependencies, and then pre-compile and instal all our code, and install everything into the Python path. The main difference to the previous versions is that our own code is pre-compiled too, instead of just executed from the source checkout. There is little immediate effect from this: The interpreter startup might be slightly faster, because it doesn’t need to compile our code every time. In a web-app environment this is likely not noticeable.

But because our dependencies and our own code are now properly installed in the same place, we can drop our source code from the final container.

(We’ve also likely introduced a problem with non-code files, such as templates and graphics assets, in our project. They will by default not be installed by setup.py. We’ll take care of this later.)

Due to the way Docker works, all changed files of every build step cumulatively determine the final container size. If we install 150MB of build dependencies, 2MB of source code and docs, generate 1MB of pre-compiled code, then delete the build dependencies and source code, our image has grown by 153MB.

This accumulation is per step: Files that aren’t present after a step don’t count towards the total space usage. A common workaround is to stuff the entire build into one step. This approach completely negates any caching: Any change in the source files (which are necessarily part of the step) also requires a complete redo of all dependencies.

Enter multi-stage build: At any point in the Dockerfile we’re allowed to use a new FROM step to create a whole new image within the same file. Later steps can refer to previous images, but only the last image of the file will be considered the output of the image build process.

How do we get the compiled Python code from one image to the next? The Docker COPY command has an optional --from= argument to specify an image as source.

Which files do we copy over? By default, pip installs everything into /usr/local, so we could copy that. An even better approach is to use pip install --prefix=... to install into an isolated non-standard location. This allows us to grab all the files related to our project and no others.

Preconditions (click to expand)

The Django configuration must set STATIC_ROOT="/app/static.dist".
A setup.py must be present to properly install our project. All dependency information needs to be in setup.py, no requirements.txt is used.

# Dockerfile
FROM python:slim as common-base

ENV DJANGO_SETTINGS_MODULE foo.settings

# Intermediate image, all compilation takes place here
FROM common-base as builder

RUN pip install -U pip setuptools

RUN mkdir -p /app
WORKDIR /app

RUN apt-get update && apt-get install -y build-essential python3-dev
RUN mkdir -p /install

COPY . .

RUN sh -c 'pip install --no-warn-script-location --prefix=/install .'
RUN cp -r /install/* /usr/local
RUN sh -c 'python manage.py collectstatic --no-input'

# Final image, just copy over pre-compiled files
FROM common-base

RUN mkdir -p /app
COPY docker-entrypoint.sh /app/
COPY --from=builder /install /usr/local
COPY --from=builder /app/static.dist /app/static.dist

ENTRYPOINT ["/app/docker-entrypoint.sh"]

This will drastically reduce our final image size since neither the build-essential packages, nor any of the source dependencies are part of it. However, we’re back to our cache-invalidation problem: Any code change invalidates all caches starting at line 17, requiring Docker to redo the full Python dependency installation.

One possible solution is to re-use the previous trick of copying the requirements.txt first, in isolation, to only install the dependencies. But that would mean we need to manage dependencies in both requirements.txt and setup.py. Is there an easier way?

Multi-Stage, Cache-Friendly Build

The command setup.py egg_info will create a foo.egg-info directory with various bits of information about the package, including a requirements.txt.

We’ll execute egg_info in an isolated image, copy the requirements.txt to a new image (in order to be independent from changes in setup.py other than the list of requirements), then install dependencies using the generated requirements.txt. Up to here these steps are fully cacheable unless the list of project dependencies changes. Afterwards we’ll proceed in the usual fashion by copying over the remaining source code and installing it.

(One snap: The generated requirements.txt also contains all possible extras listed in setup.py, under bracket separated sections such as [dev]. pip cannot handle that, so we’ll use grep to cut the generated requirements.txt at the first blank line.)

Preconditions (click to expand)

The Django configuration must set STATIC_ROOT="/app/static.dist".
All dependencies for production use are in the normal setup.py dependencies, and not in extras.

# Dockerfile
FROM python:slim as common-base

ENV DJANGO_SETTINGS_MODULE foo.settings

FROM common-base as base-builder

RUN pip install -U pip setuptools

RUN mkdir -p /app
WORKDIR /app

# Stage 1: Extract dependency information from setup.py alone
#  Allows docker caching until setup.py changes
FROM base-builder as dependencies

COPY setup.py .
RUN python setup.py egg_info

# Stage 2: Install dependencies based on the information extracted from the previous step
#  Caveat: Expects an empty line between base dependencies and extras, doesn't install extras
# Also installs gunicon in the same step
FROM base-builder as builder
RUN apt-get update && apt-get install -y build-essential python3-dev
RUN mkdir -p /install
COPY --from=dependencies /app/foo.egg-info/requires.txt /tmp/
RUN sh -c 'pip install --no-warn-script-location --prefix=/install $(grep -e ^$ -m 1 -B 9999 /tmp/requires.txt) gunicorn'

# Everything up to here should be fully cacheable unless dependencies change
# Now copy the application code

COPY . .

# Stage 3: Install application
RUN sh -c 'pip install --no-warn-script-location --prefix=/install .'

# Stage 4: Install application into a temporary container, in order to have both source and compiled files
#  Compile static assets
FROM builder as static-builder

RUN cp -r /install/* /usr/local

RUN sh -c 'python manage.py collectstatic --no-input'

# Stage 5: Install compiled static assets and support files into clean image
FROM common-base

RUN mkdir -p /app
COPY docker-entrypoint.sh /app/
COPY --from=builder /install /usr/local
COPY --from=static-builder /app/static.dist /app/static.dist

ENTRYPOINT ["/app/docker-entrypoint.sh"]

Addendum: Handling data files

When converting your project to be installable with setup.py, you should make sure that you’re not missing any files in the final build. Run setup.py egg_info and then check the generated foo.egg-info/SOURCES.txt for missing files.

A common trip-up is the distinction between Python packages and ordinary directories. By definition a Python package is a directory that contains an __init__.py file (can be empty). By default setup.py only installs Python packages. So make sure you’ve got __init__.py files also on all intermediate directory levels of your code (check in management/commands, for example).

If your project uses templates or other data files (not covered by collectstatic), you need to do two things to get setup.py to pick them up:

Set include_package_data=True in the call to setuptools.setup() in setup.py.
Add a MANIFEST.in file next to setup.py that contains instructions to include your data files.
The most straightforward way for a template directory is something like recursive-include foo/templates *

The section on Including Data Files in the setuptools documentation covers this in more detail.

Showing SQL Queries with Pytest and Django

I have a Django based project, and am doing unit tests with py.test. To debug a test failure it’s sometimes useful to see the actual SQL queries that Django emitted, which is surprisingly hard. I assumed that that would be such an obvious and common need, that a simple switch (for pytest-django) or easy plugin would exist to simply output SQL queries as they are executed.

It is a common need alright (1, 2, 3), but the correct solution is surprisingly unwieldy¹.

For one, there is no existing helper or plugin. There are helpers and plugins to count queries and assert a certain query count, which as a side effect track all queries and print the executed queries on query count assertion failure, but I’ve yet to find any case where that would be useful to me. More importantly it’s useless for the exact case here: The stored list of queries is only printed if the expected query count is not matched, not in any other case, such as, say, a failing unit test which you’d want to debug by inspecting the queries that were executed.

Therefore: Fuck it, let’s do it live. Django tracks all queries in the connection object, but in general only if DEBUG=True. For various reasons, tests are executed with DEBUG=False, which is a good thing, since you want to test close to production. Django does provide a context helper to temporarily enable query tracking on a connection which we’ll use instead.

Putting it together, we need to transform a humble test such as

@pytest.mark.django_db
def test_frobnicate_foo(foo):
    assert foo.frobnicate()

into

@pytest.mark.django_db
def test_frobnicate_foo(foo):
    from django.db import connection
    from django.test.utils import CaptureQueriesContext
    with CaptureQueriesContext(connection):
        assert foo.frobnicate(), connection.queries[0]['sql']

in order to see the value of the first SQL query in case of assertion failure.

At some point someone™ should write a generic plugin to do that.

There are several incorrect solutions on StackOverflow, such as the one that starts with “First, subclass TestCase”, which doesn’t apply to py.test, or the ever helpful “try using django-debug-toolbar”, which doesn’t apply to unit tests in general. ↩

Auditing User Intent in Closed Source IoT Applications

(Header image under CC-BY by Gregory Varnum)

Hardware-backed voice assistants like Amazon Alexa and Google Assistant have received some criticism for their handling of voice data behind the scenes. The companies had outsourced quality control/machine learning feedback to external contractors who received voice recordings of user commands and were tasked to improve the assistants’ recognition of voice and intent. This came as a surprise to many users, who only expected their voice commands to be processed by automated systems and not listened to by actual humans.

It is worth noting that the user intent for their voice to be recorded and sent to the cloud was generally not called into question: While the devices listen all the time, their recording and sending only starts after they detect a so-called wake word: “Alexa” or “Hey Google”. There are scattered reports of accidental activation with similar sounds (“Alec, say, what’s up?”), but on balance this part of the system appears to be reasonably robust. Accidental activation is always mitigated by the fact that the devices clearly indicate their current mode: Recognition of the wake word triggers a confirmation sound and LEDs to light up.

New reporting shows how a malicious actor can get around this user intent in a limited fashion: Several sets of bugs in the system design allowed the assistant to stay awake and send recordings to the attacker even when a user might reasonably expect them not to be. It is important to note that these bugs are not remote-access vulnerabilities! User intent is still necessary to start the interaction, it’s just that the interaction lasts longer than the user expects. Also, none of the local safeguards against undected listening are impacted: The LEDs still light up and indicate an attentive assistant.

It is in the companies’ best interest to not be found spying on their users, and the easiest way to achieve that is by not doing it. Amazon, specifically, tries very hard to be seen as privacy-preserving because that enables additional services for them. Their in-home delivery service is absolutely dependent on consumers trusting them to open their doors for the delivery driver (who in turn is instructed not to actually enter the home, but just drop the package right on the other side of the door, and is filmed doing so). Amazon demonstrates their willingness to at least appear privacy-respecting on other fronts too: The microphone-off button on the Amazon Echo devices cuts power to the microphone array and lights up a “mute” LED: it’s impossible to turn on the LED under software control. When the LED is on, the microphone is off.

The primary concern still is an issue of trust: Do I trust the device to only record and transmit audio when I intend for it to do so? In theory the device manufacturer could have the device surreptitiously record everything. There’s no easy way to audit either the device or its connections to the outside world. Some progress has been made to extract and analyse device firmware, but ultimately this cannot rule out a silent firmware update with listening capabilities at a later date.

An Auditing System to Confirm User Intent

This essay proposes a system in which users can gain confidence that they are not surreptitiously monitored, without requiring a device manufacturer to give up any of their proprietary secrets. It assumes cooperation on the part of the manufacturer and a certain technical expertise on the part of at least some of the users.

Step 1: The manufacturer augments their back-end systems to log device activity and TLS session keys, and keeps these logs for a certain number of days.

Step 2: The end user passively records all incoming and outgoing traffic from the device. Obviously only a small percentage of end users will be able to do that and only a fraction of those will actually record the traffic. But since the manufacturer cannot be sure which devices are being monitored, they risk detection if they tamper with any of them.

Step 3: The user requests a list of sessions keys from the manufacturer and uses it to decrypt the captured connections.

Step 4: The manufacturer provides a machine-readable list of activities of the device, both user-initiated (such as queries to a voice assistant) and automated (such as firmware updates, or normal device telemetry).

Step 5: Analysis software matches the list of device activities to the recorded connections and flags any suspicious activity. The software should be open source, initially provided by the manufacturer, and be extensible by the community at large.

Step 6: The user can cross-check the now-vetted list of device activities to confirm whether it matches their intent.

The most impractical step is number 2: Only few users would bother to configure their networks in a way that allows the device traffic to be monitored. However, I believe that even the possibility of monitoring should deter malicious behaviour. This step is also most easily supported by third-party tools: An OpenWRT extension for example would greatly simplify the recording for users of OpenWRT, and other CPE manufacturers could follow suit¹.

The IoT device manufacturer may want to keep some data — such as firmware update files, or received audio streams — proprietary and secret. They must do so in a manner that allows the analysis tool to confirm that only downstream data is withheld: Either by using a separate, at-rest, encryption layer inside the TLS connection, or by using a separate TLS connection to a special endpoint which carries only the absolutely minimal amount of information (one small HTTP request) in the upstream direction. The analysis tool is then able to ignore the contents of this proprietary data while still being able to flag anomalies in the meta data (“Three 150MB firmware updates in a day? Really?”).

Rationale: A scheme that forcibly opens up firmware files or DRM’ed audio streams would be a non-starter for industry adoption. Decrypting this downstream content isn’t necessary for the goal of confirming user intent. Conversely all information carried on the upstream channel by definition belongs to the user, since they generated it. (If they didn’t generate it, it wouldn’t need to be transferred.)

Potential for abuse: When suggesting to store new kinds of data (step 1) it’s important to analyze the potential for abuse this data has, be it from law-enforcement agencies or from vengeful ex-partners. I believe no new threat is introduced here: The current backends of manufacturers with voice assistants already store voice recordings and generally give the option to look at the device history or download recordings (both to LEAs and to anyone with account access). The data recorded in step 1 should give no additional insight into the user behaviour beyond what is already recorded under the status quo — except that it allows to confirm the completeness of the log.

This opens up the question on whether one trusts their CPE manufacturer to build correct logging and to not collude with the IoT device manufacturer. ↩