Henryk Plötz

Should I Use JWTs For Authentication Tokens?

No.


Not satisfied? Fine, fine. I’ll write a longer answer.

Let’s talk about what we’re talking about. JWT stands for JSON Web Tokens, a reasonably well defined standard for authenticated tokens. Specifically they have a header with format information, a payload, and a signature or message authentication code. The core idea is that whoever has the corresponding verification key can verify that the payload is authentic and has not been altered. What they do with that information is up to them.

The JWT spec (RFC 7519) makes suggestions by providing a few well-known registered claim names: issuer, audience, subject, expiration time, etc. A common usage pattern is that, after verifying the authenticity against whatever trust relationship they have with the issuer, the recipient checks whether they are the intended audience (if any is specified) and the expiration time has not yet passed, and then take the subject as an authenticated identity of the bearer of the token.

It’s perfectly designed for bearer token authentication! Or is it? Let me be clear: JWT as authentication tokens are constructed for Google/Facebook scale environments, and absolutely no one who is not Google/Facebook needs to put up with the ensuing tradeoffs. If you process less than 10k requests per second, you’re not Google nor are you Facebook.

The core benefit, proponents will tell you, is that the recipient of a JWT doesn’t need to connect to the user database to verify the token authenticity and render its service. In a large installation, like Google’s, that means that the JWT issuer, the authentication service, can be a dedicated service that is managed and scaled like other services, and is the only service that needs to access the centralized user database. All other services can act on the information stored in the JWT alone, and don’t need to go through the user database, which would represent a choke point.

What about logout/session invalidation? Well, in order for this model to work, the authentication token should have a fairly short lifetime. Maybe 5 minutes, max. The client is also issued a second token, the so-called refresh token, with which it can request a new authentication token from the authentication service. This gives the authentication service a chance to consult the user database to see whether the user or a specific session has been blocked in the meantime.

Here’s the twist that is rarely, if ever, spelled out: In this setup the refresh token, not the authentication token, is the real session token. The refresh token represents the session with the authentication service (which can be revoked), while the authentication tokens are just derived credentials to be used for a few requests at most. The beauty, from Google’s point of view, is that this delegates keeping the session alive to the client, i.e. not Google’s servers. Oh and by the way, the refresh token can be, and usually is, opaque, since it’s only ever consumed by the same service that creates it. That reduces a lot of complexity, by just using an opaque identifier stored in a database.

Now, let’s assume you are not Google. Check which of these apply to you:

  • You wanted to implement log-out, so now you’re keeping an allowlist of valid JWTs, or a denylist of revoked JWTs. To check this you hit the database on each request.
  • You need to be able to block users entirely, so you check a “user active” flag in the database. You hit the database on each request.
  • You need additional relationships between the user object and other objects in the database. You hit the database on each request.
  • Your service does anything at all with data in the database. You hit the database on each request.

Congratulations, if you confirmed any of the items above, you don’t need JWTs. You’re hitting the database anyway, and I’m pretty sure that you only have one database which stores both your user profiles and your application data. By just using a “normal” opaque session token and storing it in the database, the same way Google does with the refresh token, and dropping all JWT authentication token nonsense, you stand to reap these great benefits:

  • No weird workarounds (allow/denylist) for shortcomings of JWT as authentication token
  • Greatly reduced complexity. No need to manage a secure JWT signing/authentication key
  • You get to pass on some interesting bugs.

Just use the normal session mechanism that comes with your web framework and that you were using before someone told you that Google uses JWT. It has stood the test of time and is probably fine.

If you need something to do to make you feel like you’re running a big deployment, you can probably configure your session mechanism to use redisvalkey to store the session data. You’re still going to use the authenticated user id to query the database, but for unauthenticated requests it may be faster/use less resources. It might not be. You’ll have to tune and measure that.

On The End of Bitcoin

[This is a repost of a thread of mine from a dead social media platform. The original posting date was May 24th, 2022. I have not added new content.]

Is there some scientific literature/scenario planning on the End of Bitcoin? With the next two halvings mining income will shift significantly from block reward to transaction fees. Even though this has been going on for some time, we’re now closing in on the point where …

… the block reward is lower than the energy costs (exponential functions, yeah). This must lead to a significant rise in transaction fees (as in: factor 20). This will significantly discourage Bitcoin usage, which will negatively affect price.

Low price and low block reward means that miners will switch off mining capacity. Now, the difficulty will adapt, but this makes the network brittle. The reason for this: A mining operation consists of setup costs, access to electricity, and operational costs.

A switched-off mining rig already has the first two components available (and, likely, amortized), so it becomes standby. It is mining capacity that exists, and can be used at a moment’s notice, but is invisible to the difficulty regulation algorithm.

I see several different outcomes. An unstable equilibrium could arise, where capacity is switched on and off based on market rates. But, at some point between here and there, there could be a point …

… where a 51% attack (due to low difficulty and high standby capacity) is both feasible and lucrative. This would completely destroy trust in the system (since it can be repeated at any time) and crater the price, effectively ending Bitcoin.

In the meantime, high transaction fees making it unattractive to trade Bitcoin are also a problem for the miners, who are paid in Bitcoin. If no one wants to buy, they can’t pay their electricity bills. Now, the Lightning people will say this is no problem on Lightning.

But even so you’ll still need a couple of basic transactions to maintain the Lightning network. You know who doesn’t need to pay Bitcoin transaction fees? Miners. They would take their earnings (in Bitcoin) and convert them into Lightning and operate as Lightning hubs.

So another end result (though brittle, see above) has the current Bitcoin mining titans operating as quasi-centralized Lightning brokers.

This would theoretically be feasible indefinitely, but at some point, I hope, we will get the 5 companies that are running Bitcoin/Lightning by then to simply switch off Bitcoin and use a normal database, effectively turning them into regular (and: regulated) banks.

Understanding Common Table Expressions in SQL

One of the lesser known features of modern SQL are so-called “Common Table Expressions” (CTE) or “WITH queries”. I’ll explain the mental model that helped me make sense of them, and how to use them to execute recursive queries. Afterwards I’ll show how to apply these techniques in Django.

Syntactically a CTE consists of one or more statements marked with WITH and placed before the statement they relate to, within the same SQL query. Conceptually these additional statements behave as if defining a view, or temporary table(s), that is valid only within this one SQL query.

The intended use is for simplifying complex or repeated operations and pulling them out of the main query. Let’s say you have normalized your database beyond all reason and have the following schema for storing names1:

CREATE TABLE name (
    id bigint NOT NULL,
    name character varying(50) NOT NULL,
    CONSTRAINT name_pkey PRIMARY KEY (id)
);
CREATE TABLE person (
    id bigint NOT NULL,
    first_name_id bigint NOT NULL,
    last_name_id bigint NOT NULL,
    CONSTRAINT person_pkey PRIMARY KEY (id),
    CONSTRAINT first_name FOREIGN KEY (first_name_id)
        REFERENCES name (id),
    CONSTRAINT last_name FOREIGN KEY (last_name_id)
        REFERENCES name (id)
);

Given this schema, you’d have to use something like SELECT CONCAT(first.name, ' ', last.name) everywhere you wanted a full name, together with a join along the foreign keys.

Even this small example becomes tiresome pretty fast. It’s worse for more complex cases and gets even more complicated when you consider computed or aggregate functions.

The WITH statement lets you extract the complications from your query and get them over with first. A query for full name substrings could look like this:

WITH full (id, name) AS (
  SELECT 
    p.id AS id,
    CONCAT(first.name, ' ', last.name) AS name
  FROM
    person p
    LEFT JOIN name first
    LEFT JOIN name last
  WHERE
    first.id = p.first_name_id,
    last.id = p.last_name_id
)
SELECT id, name FROM full
WHERE name LIKE '%om Ridd%';

This behaves as if a temporary table named full with columns id, name is created and filled with the results from the first SELECT statement (the CTE), just before executing the second, main, SELECT statement. In the main SELECT you do not need to worry about the JOINs or other details from the inside of the CTE. It will appear as if it were a normal table.

Multiple CTE in one query are possible, and neither the CTE nor the main query are limited to SELECT. The PostgreSQL documentation has an example with DELETE … RETURNING and INSERT INTO … SELECT that moves rows between tables.

But so far, this is only syntactic sugar, an easier way to write something you can also express otherwise. You could achieve the same result with a (possibly materialized) view, though you’d have to modify the database schema. Or you could include everything from the CTE in your main query, but have to take great care not to trip up when using DISTINCT or GROUP BY. Or you have an ORM that will handle all the complicated queries for you and will allow reuse of complex expressions.

Then everything changes when you learn about WITH RECURSIVE. The magic consists of two building blocks: First, a WITH RECURSIVE statement may refer to itself. That is, in the SELECT that fills data into your imaginary temporary table you can read data from the imaginary temporary table. This, in itself, is not very useful, since the table starts out empty. The second building block is extending the CTE SELECT to include data from the start.

The canonical use case of a recursive query looks like this:

WITH RECURSIVE my_cte AS (
  SELECT ... /* base case*/
    FROM somewhere_else
    WHERE ...
  UNION
  SELECT ... /* recursive step */
    FROM somewhere_else JOIN my_cte
    WHERE ...
)
SELECT * FROM my_cte;

The two SELECT statements in the CTE perform the functions of base case and recursive step in the definition of recursion. The semantics of WITH RECURSIVE is such that it repeats the query, storing results in the imaginary temporary table, until no new rows are being added. In other words, the stop condition is implicitly fixed and cannot be altered. You have to make sure that it terminates eventually, for example with a LIMIT clause.

An artificial example would be to create a sequence of numbers:

WITH RECURSIVE numbers (n) AS (
  SELECT 1 AS n
  UNION
  SELECT numbers.n + 1 AS n
    FROM numbers
)
SELECT * FROM numbers LIMIT 10;

This example shows that the base case needn’t necessarily come from another table. You can also see that in most simple cases the final, main, SELECT is just a dummy SELECT * (in this case with an added LIMIT) because most of the heavy lifting is being done in the CTE.

Recursive CTEs allow SQL to do things not possible without. A prime example is operating on trees in adjacency list form. This is the simplest and most obvious way to represent hierarchical data, but without recursive CTEs it’s not possible to directly express some common queries2, such as retrieving all ancestors of a node.

To illustrate here’s an example with the required basics of a simple hierarchical tree structure:

CREATE TABLE node (
    id bigint NOT NULL,
    parent_id bigint,
    name character varying (50) NOT NULL,
    CONSTRAINT node_pkey PRIMARY KEY (id),
    CONSTRAINT parent FOREIGN KEY (parent_id)
        REFERENCES node (id)
);

Every node has a name and an optional foreign key reference to its parent node. It’s very easy to query for a node’s parent node, or all child nodes of a specific node. All other queries are more complicated, or impossible, without a recursive CTE. Let’s take this fictional table:

idnameparent_id
1EnochNULL
2Methuselah1
3Lamech2
4Noah3
5Shem4
6Ham4
7Japeth4

To get all children of Noah:

SELECT child.name
  FROM node child
    JOIN node parent ON child.parent_id = parent.id
  WHERE parent.name = 'Noah';

To get Lamech’s father:

SELECT parent.name
  FROM node child
    JOIN node parent ON child.parent_id = parent.id
  WHERE child.name = 'Lamech';

You could extend the queries with additional JOINs to also handle grandparents, great-grandparents and so on, but it’s impossible in normal SQL to handle arbitrarily long chains of ancestors or descendants. Now consider this recursive query for all of Japeth’s ancestors:

WITH RECURSIVE ancestors (id, name, parent_id) AS (
  SELECT id, name, parent_id
    FROM node
    WHERE name = 'Japeth'
  UNION
  SELECT parent.id, parent.name, parent.parent_id
    FROM node parent
      JOIN ancestors child ON parent.id = child.parent_id
)
SELECT name FROM ancestors;

The second SELECT is a bit confusing. Here’s what happens: At first the imaginary temporary table contains only one row, corresponding to Japeth, the starting point for the following steps. The second SELECT then, for every node already in ancestors (now aliased to child, because these are all children considered so far), finds its parent from node and adds it to the ancestors.

So at every step the ancestors imaginary temporary table contains a set of all parents discovered so far, and all their parents (that is, all nodes that consider the existing nodes their child) are added until no more new parents are discovered.

A common variant is to include a path length variable to quantify the degree of relationship (and for example also be able to exactly query for all paths of a specific length). Another technique is to not pass through the entire row in the CTE, but only operate on the primary keys and then JOIN for the remaining columns later. Let’s look at an example for all descendants of Methuselah:

WITH RECURSIVE descendants (id, n) AS (
  SELECT id, 0
    FROM node
    WHERE name = 'Methuselah'
  UNION
  SELECT child.id, parent.id+1
    FROM node child
      JOIN descendants parent ON parent.id = child.parent_id
)
SELECT descendants.n, node.name
  FROM descendants
    LEFT JOIN node ON descendants.id = node.id;

You should see a common pattern to keep in mind: In the second SELECT the node object (called child) conceptually matches our CTE name (descendants), while the CTE reference is the reverse (parent)! We’re adding a node child for every parent already in the CTE.

Bonus Round: CTEs in Django

The django-cte package allows using CTEs in the Django ORM with the normal query API. The equivalent Django code for the last example looks like this:

# model.py
from django.db import models
from django_cte import CTEManager

class Node(models.Model):
  objects = CTEManager()
  id = AutoField(primary_key=True)
  name = models.CharField(max_length=50, null=False)
  parent = models.ForeignKey("self", null=True, on_delete=CASCADE)
from django.db.models import Value, IntegerField
from django_cte import With
from .models import Node

descendants = With.recursive(
  lambda cte: Node.objects.filter(
      name="Methusaleh"
    ).values(
      "id",
      n=Value(0, output_field=IntegerField())
    ).union(
      cte.join(Node, parent_id=cte.col.id).values(
        "id",
        n=cte.col.n + Value(1, output_field=IntegerField())
      )
    )
)

descendants_set = descendants.join(
  Node, id=descendants.col.id
).with_cte(descendants).annotate(
  n=descendants.col.n
)

A few observations to keep in mind:

  • In Django all database expressions need to start with a QuerySet.
  • The django_cte.With object wraps a QuerySet to allow additional methods, including with_cte() which adds the common table expression. Remember that in SQL the CTE goes before the main query (the QuerySet), which might be confusing here.
  • In order to map the self-referential nature of WITH RECURSIVE to Python syntax, django_cte.With.recursive() takes a callable that produces a QuerySet. The callable receives a reference to the CTE.
  • django_cte.With.recursive() needs to be JOINed to the underlying Model to be useful. You also need to .annotate() any computed columns from your CTE to use them outside of the CTE.


  1. Don’t. Instead, please read Falsehoods Programmers Believe About Names 

  2. Workarounds include nested sets and materialized paths, but with additional requirements on the data structures and some maintenance cost on updates 

An Efficient Multi-Stage Build for Python Django in Docker

A Docker brand motorized tricycle, looks fragile and overladen

We’ve recently begun dockerizing our applications in an effort to make development and deployment easier. One of the challenges was establishing a good baseline Dockerfile which can maximize the benefits of Dockers caching mechanism and at the same time provide minimal application images without any superfluous contents.

The basic installation flow for any Django project (let’s call it foo) is simple enough:

export DJANGO_SETTINGS_MODULE=foo.settings
pip install -r requirements.txt
python manage.py collectstatic
python manage.py compilemessages
python manage.py migrate

(Note: In this blog post we’ll mostly ignore the commands to actually get the Django project running within a web server. We’ll end up using gunicorn with WSGI, but won’t comment further on it.)

This sequence isn’t suitable for a Dockerfile as-is, because the final command in the sequence creates the database within the container image. Except for very specific circumstances this is likely not desired. In a normal deployment the database is located either on a persistent volume mounted from outside, or in another container completely.

First lesson: The Django migrate command needs to be part of the container start script, as opposed to the container build script. It’s harmless/idempotent if the database is already fully migrated, but necessary on the first container start, and on every subsequent update that includes database migrations.

Baseline Dockerfile

A naive Dockerfile and accompanying start script would look like this:

# Dockerfile
FROM python:slim
ENV DJANGO_SETTINGS_MODULE foo.settings
RUN mkdir -p /app
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt gunicorn
RUN python manage.py collectstatic
RUN python manage.py compilemessages
ENTRYPOINT ["/app/docker-entrypoint.sh"]
# docker-entrypoint.sh
cd /app
python manage.py migrate
exec gunicorn --bind '[::]:80' --worker-tmp-dir /dev/shm --workers "${GUNICORN_WORKERS:-3}" foo.wsgi:application

(The --worker-tmp-dir bit is a workaround for the way Docker mounts /tmp. See Configuring Gunicorn for Docker.)

This approach does work, but has two drawbacks:

  • Large image size. The entire source checkout of our application will be in the final docker image. Also, depending on the package requirements we may need to apt-get install a compiler or development package before executing pip install. These will then also be in the final image (and on our production machine).
  • Long re-build time. Any change to the source directory will invalidate the Docker cache starting with line 6 in the Dockerfile. The pip install will be executed fully from scratch every time.

(Note: We’re using the slim Python docker image. The alpine image would be even smaller, but its use of the musl C library breaks some Python modules. Depending on your dependencies you might be able to swap in python:alpine instead of python:slim.)

Improved Caching

Docker caches all individual build steps, and can use the cache when the same step is applied to the same current state. In our naive Dockerfile all the expensive commands are dependent on the full state of the source checkout, so the cache cannot be used after even the tiniest code change.

The common solution looks like this:

# Dockerfile
FROM python:slim
ENV DJANGO_SETTINGS_MODULE foo.settings
RUN mkdir -p /app
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt gunicorn
COPY . .
RUN python manage.py collectstatic
RUN python manage.py compilemessages
ENTRYPOINT ["/app/docker-entrypoint.sh"]

In this version the pip install command on line 7 can be cached until the requirements.txt or the base image change. Re-build time is drastically reduced, but the image size is unaffected.

Building with setup.py

If we package up our Django project as a proper Python package with a setup.py, we can use pip to install it directly (and could also publish it to PyPI).

If the setup.py lists all project dependencies (including Django) in install_requires, then we’re able to execute (for example in a virtual environment):

pip install .

This will pre-compile and install all our dependencies, and then pre-compile and instal all our code, and install everything into the Python path. The main difference to the previous versions is that our own code is pre-compiled too, instead of just executed from the source checkout. There is little immediate effect from this: The interpreter startup might be slightly faster, because it doesn’t need to compile our code every time. In a web-app environment this is likely not noticeable.

But because our dependencies and our own code are now properly installed in the same place, we can drop our source code from the final container.

(We’ve also likely introduced a problem with non-code files, such as templates and graphics assets, in our project. They will by default not be installed by setup.py. We’ll take care of this later.)

Due to the way Docker works, all changed files of every build step cumulatively determine the final container size. If we install 150MB of build dependencies, 2MB of source code and docs, generate 1MB of pre-compiled code, then delete the build dependencies and source code, our image has grown by 153MB.

This accumulation is per step: Files that aren’t present after a step don’t count towards the total space usage. A common workaround is to stuff the entire build into one step. This approach completely negates any caching: Any change in the source files (which are necessarily part of the step) also requires a complete redo of all dependencies.

Enter multi-stage build: At any point in the Dockerfile we’re allowed to use a new FROM step to create a whole new image within the same file. Later steps can refer to previous images, but only the last image of the file will be considered the output of the image build process.

How do we get the compiled Python code from one image to the next? The Docker COPY command has an optional --from= argument to specify an image as source. 

Which files do we copy over? By default, pip installs everything into /usr/local, so we could copy that. An even better approach is to use pip install --prefix=... to install into an isolated non-standard location. This allows us to grab all the files related to our project and no others.

# Dockerfile
FROM python:slim as common-base

ENV DJANGO_SETTINGS_MODULE foo.settings

# Intermediate image, all compilation takes place here
FROM common-base as builder

RUN pip install -U pip setuptools

RUN mkdir -p /app
WORKDIR /app

RUN apt-get update && apt-get install -y build-essential python3-dev
RUN mkdir -p /install

COPY . .

RUN sh -c 'pip install --no-warn-script-location --prefix=/install .'
RUN cp -r /install/* /usr/local
RUN sh -c 'python manage.py collectstatic --no-input'

# Final image, just copy over pre-compiled files
FROM common-base

RUN mkdir -p /app
COPY docker-entrypoint.sh /app/
COPY --from=builder /install /usr/local
COPY --from=builder /app/static.dist /app/static.dist

ENTRYPOINT ["/app/docker-entrypoint.sh"]

This will drastically reduce our final image size since neither the build-essential packages, nor any of the source dependencies are part of it. However, we’re back to our cache-invalidation problem: Any code change invalidates all caches starting at line 17, requiring Docker to redo the full Python dependency installation.

One possible solution is to re-use the previous trick of copying the requirements.txt first, in isolation, to only install the dependencies. But that would mean we need to manage dependencies in both requirements.txt and setup.py. Is there an easier way?

Multi-Stage, Cache-Friendly Build

The command setup.py egg_info will create a foo.egg-info directory with various bits of information about the package, including a requirements.txt.

We’ll execute egg_info in an isolated image, copy the requirements.txt to a new image (in order to be independent from changes in setup.py other than the list of requirements), then install dependencies using the generated requirements.txt. Up to here these steps are fully cacheable unless the list of project dependencies changes. Afterwards we’ll proceed in the usual fashion by copying over the remaining source code and installing it.

(One snap: The generated requirements.txt also contains all possible extras listed in setup.py, under bracket separated sections such as [dev]. pip cannot handle that, so we’ll use grep to cut the generated requirements.txt at the first blank line.)

# Dockerfile
FROM python:slim as common-base

ENV DJANGO_SETTINGS_MODULE foo.settings

FROM common-base as base-builder

RUN pip install -U pip setuptools

RUN mkdir -p /app
WORKDIR /app

# Stage 1: Extract dependency information from setup.py alone
#  Allows docker caching until setup.py changes
FROM base-builder as dependencies

COPY setup.py .
RUN python setup.py egg_info

# Stage 2: Install dependencies based on the information extracted from the previous step
#  Caveat: Expects an empty line between base dependencies and extras, doesn't install extras
# Also installs gunicon in the same step
FROM base-builder as builder
RUN apt-get update && apt-get install -y build-essential python3-dev
RUN mkdir -p /install
COPY --from=dependencies /app/foo.egg-info/requires.txt /tmp/
RUN sh -c 'pip install --no-warn-script-location --prefix=/install $(grep -e ^$ -m 1 -B 9999 /tmp/requires.txt) gunicorn'

# Everything up to here should be fully cacheable unless dependencies change
# Now copy the application code

COPY . .

# Stage 3: Install application
RUN sh -c 'pip install --no-warn-script-location --prefix=/install .'

# Stage 4: Install application into a temporary container, in order to have both source and compiled files
#  Compile static assets
FROM builder as static-builder

RUN cp -r /install/* /usr/local

RUN sh -c 'python manage.py collectstatic --no-input'

# Stage 5: Install compiled static assets and support files into clean image
FROM common-base

RUN mkdir -p /app
COPY docker-entrypoint.sh /app/
COPY --from=builder /install /usr/local
COPY --from=static-builder /app/static.dist /app/static.dist

ENTRYPOINT ["/app/docker-entrypoint.sh"]

Addendum: Handling data files

When converting your project to be installable with setup.py, you should make sure that you’re not missing any files in the final build. Run setup.py egg_info and then check the generated foo.egg-info/SOURCES.txt for missing files.

A common trip-up is the distinction between Python packages and ordinary directories. By definition a Python package is a directory that contains an __init__.py file (can be empty). By default setup.py only installs Python packages. So make sure you’ve got __init__.py files also on all intermediate directory levels of your code (check in management/commands, for example).

If your project uses templates or other data files (not covered by collectstatic), you need to do two things to get setup.py to pick them up:

  • Set include_package_data=True in the call to setuptools.setup() in setup.py.
  • Add a MANIFEST.in file next to setup.py that contains instructions to include your data files.
    The most straightforward way for a template directory is something like recursive-include foo/templates *

The section on Including Data Files in the setuptools documentation covers this in more detail.

Showing SQL Queries with Pytest and Django

I have a Django based project, and am doing unit tests with py.test. To debug a test failure it’s sometimes useful to see the actual SQL queries that Django emitted, which is surprisingly hard. I assumed that that would be such an obvious and common need, that a simple switch (for pytest-django) or easy plugin would exist to simply output SQL queries as they are executed.

It is a common need alright (1, 2, 3), but the correct solution is surprisingly unwieldy1.

For one, there is no existing helper or plugin. There are helpers and plugins to count queries and assert a certain query count, which as a side effect track all queries and print the executed queries on query count assertion failure, but I’ve yet to find any case where that would be useful to me. More importantly it’s useless for the exact case here: The stored list of queries is only printed if the expected query count is not matched, not in any other case, such as, say, a failing unit test which you’d want to debug by inspecting the queries that were executed.

Therefore: Fuck it, let’s do it live. Django tracks all queries in the connection object, but in general only if DEBUG=True. For various reasons, tests are executed with DEBUG=False, which is a good thing, since you want to test close to production. Django does provide a context helper to temporarily enable query tracking on a connection which we’ll use instead.

Putting it together, we need to transform a humble test such as

@pytest.mark.django_db
def test_frobnicate_foo(foo):
    assert foo.frobnicate()

into

@pytest.mark.django_db
def test_frobnicate_foo(foo):
    from django.db import connection
    from django.test.utils import CaptureQueriesContext
    with CaptureQueriesContext(connection):
        assert foo.frobnicate(), connection.queries[0]['sql']

in order to see the value of the first SQL query in case of assertion failure.

At some point someone™ should write a generic plugin to do that.


  1. There are several incorrect solutions on StackOverflow, such as the one that starts with “First, subclass TestCase”, which doesn’t apply to py.test, or the ever helpful “try using django-debug-toolbar”, which doesn’t apply to unit tests in general. 

Auditing User Intent in Closed Source IoT Applications

Amazon Echo Dot

(Header image under CC-BY by Gregory Varnum)

Hardware-backed voice assistants like Amazon Alexa and Google Assistant have received some criticism for their handling of voice data behind the scenes. The companies had outsourced quality control/machine learning feedback to external contractors who received voice recordings of user commands and were tasked to improve the assistants’ recognition of voice and intent. This came as a surprise to many users, who only expected their voice commands to be processed by automated systems and not listened to by actual humans.

It is worth noting that the user intent for their voice to be recorded and sent to the cloud was generally not called into question: While the devices listen all the time, their recording and sending only starts after they detect a so-called wake word: “Alexa” or “Hey Google”. There are scattered reports of accidental activation with similar sounds (“Alec, say, what’s up?”), but on balance this part of the system appears to be reasonably robust. Accidental activation is always mitigated by the fact that the devices clearly indicate their current mode: Recognition of the wake word triggers a confirmation sound and LEDs to light up.

New reporting shows how a malicious actor can get around this user intent in a limited fashion: Several sets of bugs in the system design allowed the assistant to stay awake and send recordings to the attacker even when a user might reasonably expect them not to be. It is important to note that these bugs are not remote-access vulnerabilities! User intent is still necessary to start the interaction, it’s just that the interaction lasts longer than the user expects. Also, none of the local safeguards against undected listening are impacted: The LEDs still light up and indicate an attentive assistant.

It is in the companies’ best interest to not be found spying on their users, and the easiest way to achieve that is by not doing it. Amazon, specifically, tries very hard to be seen as privacy-preserving because that enables additional services for them. Their in-home delivery service is absolutely dependent on consumers trusting them to open their doors for the delivery driver (who in turn is instructed not to actually enter the home, but just drop the package right on the other side of the door, and is filmed doing so). Amazon demonstrates their willingness to at least appear privacy-respecting on other fronts too: The microphone-off button on the Amazon Echo devices cuts power to the microphone array and lights up a “mute” LED: it’s impossible to turn on the LED under software control. When the LED is on, the microphone is off.

The primary concern still is an issue of trust: Do I trust the device to only record and transmit audio when I intend for it to do so? In theory the device manufacturer could have the device surreptitiously record everything. There’s no easy way to audit either the device or its connections to the outside world. Some progress has been made to extract and analyse device firmware, but ultimately this cannot rule out a silent firmware update with listening capabilities at a later date.

An Auditing System to Confirm User Intent

This essay proposes a system in which users can gain confidence that they are not surreptitiously monitored, without requiring a device manufacturer to give up any of their proprietary secrets. It assumes cooperation on the part of the manufacturer and a certain technical expertise on the part of at least some of the users.

Step 1: The manufacturer augments their back-end systems to log device activity and TLS session keys, and keeps these logs for a certain number of days.

Step 2: The end user passively records all incoming and outgoing traffic from the device. Obviously only a small percentage of end users will be able to do that and only a fraction of those will actually record the traffic. But since the manufacturer cannot be sure which devices are being monitored, they risk detection if they tamper with any of them.

Step 3: The user requests a list of sessions keys from the manufacturer and uses it to decrypt the captured connections.

Step 4: The manufacturer provides a machine-readable list of activities of the device, both user-initiated (such as queries to a voice assistant) and automated (such as firmware updates, or normal device telemetry).

Step 5: Analysis software matches the list of device activities to the recorded connections and flags any suspicious activity. The software should be open source, initially provided by the manufacturer, and be extensible by the community at large.

Step 6: The user can cross-check the now-vetted list of device activities to confirm whether it matches their intent.

The most impractical step is number 2: Only few users would bother to configure their networks in a way that allows the device traffic to be monitored. However, I believe that even the possibility of monitoring should deter malicious behaviour. This step is also most easily supported by third-party tools: An OpenWRT extension for example would greatly simplify the recording for users of OpenWRT, and other CPE manufacturers could follow suit1.

The IoT device manufacturer may want to keep some data — such as firmware update files, or received audio streams — proprietary and secret. They must do so in a manner that allows the analysis tool to confirm that only downstream data is withheld: Either by using a separate, at-rest, encryption layer inside the TLS connection, or by using a separate TLS connection to a special endpoint which carries only the absolutely minimal amount of information (one small HTTP request) in the upstream direction. The analysis tool is then able to ignore the contents of this proprietary data while still being able to flag anomalies in the meta data (“Three 150MB firmware updates in a day? Really?”).

Rationale: A scheme that forcibly opens up firmware files or DRM’ed audio streams would be a non-starter for industry adoption. Decrypting this downstream content isn’t necessary for the goal of confirming user intent. Conversely all information carried on the upstream channel by definition belongs to the user, since they generated it. (If they didn’t generate it, it wouldn’t need to be transferred.)

Potential for abuse: When suggesting to store new kinds of data (step 1) it’s important to analyze the potential for abuse this data has, be it from law-enforcement agencies or from vengeful ex-partners. I believe no new threat is introduced here: The current backends of manufacturers with voice assistants already store voice recordings and generally give the option to look at the device history or download recordings (both to LEAs and to anyone with account access). The data recorded in step 1 should give no additional insight into the user behaviour beyond what is already recorded under the status quo — except that it allows to confirm the completeness of the log.


  1. This opens up the question on whether one trusts their CPE manufacturer to build correct logging and to not collude with the IoT device manufacturer. 

KY-040 Rotary Encoder with Linux on the Raspberry Pi

For a project I with a Raspberry Pi (Zero W) needed a simple and easy input device to change a numerical value. So I bought some rotary encoders off Amazon.

If you search the Internet for information/tutorials on how to use a “KY-040” rotary encoder with Linux and the Raspberry Pi, you’ll find a dozen people or so who’ve done that, and written about it. Naturally, I’ll need to do that too — with a twist.

The most often referenced code to operate the KY-040 seems to be from Martin O’Hanlons Raspberry Pi and KY040 Rotary Encoder blog post, which even ended up being a python module KY040. The down side of this approach is that it’s very unreliable: The code triggers on an edge of one of the GPIO inputs and then reads the other input, all in Python code. To make this sort of work it needs long debounce times. The net result is that the code misses many turn events if the shaft is turned too fast, and sometimes gives the wrong turn direction.

I’m also not positive that Martins explanations of the signal output of the encoder is correct. Keyes KY-040 Arduino Rotary Encoder User Manual has a different explanation for the working principle, and some Arduino code. The difference is that, although the pins on the module are marked “CLK” and “DT” (for clock and data), it’s more common for rotary encoders to simply have pin “A” and pin “B”.

This matches what I’ve seen on this module: With pins A and B the most important distinction is the order in which they generate edges. If you only look for edges on one pin (“clock”) and then sample the other pin (“data”) you’ll kind-of also get information about the turns, but depending on the edge rate and sample speed it’s going to be unreliable.

It’s possible to observe both edges in Python with RPi.GPIO, but again, there’s a lot of overhead for what should be mostly real-time processing and is not fully reliable.

Good thing we’re using Linux which has device drivers for all sorts of things. Including, of course, a rotary encoder connected to GPIOs: rotary-encoder.txt (includes nice ASCII art on the operational principle).

Good thing also we’re using the Raspberry Pi, which has a matching device tree overlay (README for precompiled firmware).

(Note: If you’re compiling your own kernel, you’ll need the Raspberry Pi kernel starting with 4.9.y, CONFIG_INPUT_GPIO_ROTARY_ENCODER, and you’ll probably also want CONFIG_INPUT_EVDEV. The rpi-firmware with compiled overlays needs to be recent-ish, ~mid January 2018, for these examples to work.)

To enable/configure the rotary-encoder device tree overlay, simply put something like the following into /boot/config.txt (with the encoder connected to pins 23 and 24 on the Raspberry Pi):

# enable rotary encoder
dtoverlay=rotary-encoder,pin_a=23,pin_b=24,relative_axis=1

After a reboot you’ll have a new device in /dev/input/ for the rotary encoder. You can use the evtest tool (as in evtest /dev/input/event0) to look at the events it generates and confirm that it reacts perfectly to every turn of the encoder, without missing a movement or confusing the direction.

While you’re at it, you might also want to add the middle button as a key (mine is connected to pin 22):

dtoverlay=gpio-key,gpio=22,keycode=28,label="ENTER"

In order to make use of this in your Python programs: Use python-evdev.

# -*- encoding: utf-8 -*-
from __future__ import print_function

import evdev
import select

devices = [evdev.InputDevice(fn) for fn in evdev.list_devices()]
devices = {dev.fd: dev for dev in devices}

value = 1
print("Value: {0}".format(value))

done = False
while not done:
  r, w, x = select.select(devices, [], [])
  for fd in r:
    for event in devices[fd].read():
      event = evdev.util.categorize(event)
      if isinstance(event, evdev.events.RelEvent):
        value = value + event.event.value
        print("Value: {0}".format(value))
      elif isinstance(event, evdev.events.KeyEvent):
        if event.keycode == "KEY_ENTER" and event.keystate == event.key_up:
          done = True
          break

Calculating an RSA Public Key from Two Signatures

Interesting problem: You have RSA signatures and the signed data, and want to know the RSA public key that can be used to verify the signatures. For older signature schemes this is possible, if you have at least two signatures (or an oracle that can provide signatures on request).

Math is not my strong suit, but I found the necessary formula in this Cryptography StackExchange post: RSA public key recovery from signatures. It has the general idea, but is light on details and actual code.

Tools:

  • OpenSSL to generate examples
  • SageMath for the actual calculations. It has an absolutely wonderful Jupyter notebook interface.

First, let’s generate an example key and two example files. We’ll use 512 bits RSA for this example, which is about the minimum key size we can use, just to keep the examples short (in both screen real estate and calculation size). Don’t worry: while the calculation is ~30 seconds for 512 bits RSA, it’ll only grow to ~2.5 minutes for real-world 2048 bits RSA.

$ echo "Hallo, Welt" > hallowelt.txt
$ echo "Hallo, Otto" > hallootto.txt
$ openssl genrsa 512 > privkey.pem

RSA signatures are complicated beasts. In theory, you only have to hash the input and apply the RSA operation with the private key (that is, ‘decrypt’ it), but for various reasons this is highly insecure and never done in practice.

Instead, we’ll let OpenSSL handle the generation of signatures for our examples:

$ openssl dgst < hallowelt.txt -out hallowelt.txt.sig -sign privkey.pem
$ openssl dgst < hallootto.txt -out hallootto.txt.sig -sign privkey.pem

The resultant *.sig files are 64 bytes each, matching the 512 bit RSA modulus.

To better understand the RSA signature generation process (and prepare the next step), let’s look ‘into’ the signatures:

$ openssl rsautl -encrypt -inkey privkey.pem -in hallowelt.txt.sig -raw | hd
00000000  00 01 ff ff ff ff ff ff  ff ff ff ff 00 30 31 30  |.............010|
00000010  0d 06 09 60 86 48 01 65  03 04 02 01 05 00 04 20  |...`.H.e....... |
00000020  3e 6f f8 06 a5 b4 e7 e6  d7 4d 26 7f e3 db 90 a2  |>o.......M&.....|
00000030  e2 bc a3 70 e3 db 9b 10  73 fd 55 e1 06 a1 0c 2a  |...p....s.U....*|
$ openssl rsautl -encrypt -inkey privkey.pem -in hallowelt.txt.sig -raw | openssl asn1parse -offset 13 -inform der
    0:d=0  hl=2 l=  49 cons: SEQUENCE
    2:d=1  hl=2 l=  13 cons: SEQUENCE
    4:d=2  hl=2 l=   9 prim: OBJECT            :sha256
   15:d=2  hl=2 l=   0 prim: NULL
   17:d=1  hl=2 l=  32 prim: OCTET STRING      [HEX DUMP]:3E6FF806A5B4E7E6D74D267FE3DB90A2E2BCA370E3DB9B1073FD55E106A10C2A
$ openssl dgst -sha256 hallowelt.txt
SHA256(hallowelt.txt)= 3e6ff806a5b4e7e6d74d267fe3db90a2e2bca370e3db9b1073fd55e106a10c2a

The first step ‘encrypts’ the signature (that is: applies the RSA operation with the public key) and prints a hexdump of the result. In the hexdump we see:

  • Some padding: 00 01 ff ff … ff 00
  • An ASN.1 structure, consisting of 
    • A sequence (tag 30, 49 bytes), of 
      • A sequence (tag 30, 13 bytes), of 
        • An object identifier (tag 06, 9 bytes) for sha256
        • A NULL value (tag 05, 0 bytes)
      • An octet string (tag 04, 32 bytes) with 
        • The SHA-256 hash (3e6ff8…a10c2a) of the signed data

The signature follows the PKCS#1 standard for RSA signatures. All the extra stuff serves to distinguish signatures with SHA-256 from signatures with other hashes, and to prevent some attacks on the padding. It’s also the reason why we can’t go much below 512 bits RSA if we want to demo with SHA-256. (It must be noted that PKCS#1 padding shouldn’t be used anymore. The new standard is RSASSA-PSS, which has a robust security proof, but also is randomized and completely foils the technique in this blog post.)

Let’s define the first set of functions to generate this sort of padding:

import hashlib
def pkcs1_padding(size_bytes, hexdigest, hashfn):
    oid = {hashlib.sha256: '608648016503040201'}[hashfn]
    result = '06' + ("%02X" % (len(oid)/2)) + oid + '05' + '00'
    result = '30' + ("%02X" % (len(result)/2)) + result
    
    result = result + '04' + ("%02X" % (len(hexdigest)/2)) + hexdigest
    result = '30' + ("%02X" % (len(result)/2)) + result
    
    result = '0001' + ('ff' * int(size_bytes - 3 - len(result)/2) ) + '00' + result
    return result

def hash_pad(size_bytes, data, hashfn):
    hexdigest = hashfn(data).hexdigest()
    return pkcs1_padding(size_bytes, hexdigest, hashfn)

A simple test:

hash_pad(64, "Hallo, Welt\n", hashlib.sha256)

‘0001ffffffffffffffffffff003031300D0609608648016503040201050004203e6ff806a5b4e7e6d74d267fe3db90a2e2bca370e3db9b1073fd55e106a10c2a’

To perform the gcd calculation, you need for each signature the corresponding signed data, the hash function used, and the public exponent of the RSA key pair. Both hash function and public exponent may need to be guessed, but the hash is usually SHA-256, and the exponent is usually 0x10001 (65537) or 3.

The full code is as follows:

import binascii, hashlib
def message_sig_pair(size_bytes, data, signature, hashfn=hashlib.sha256):
    return ( Integer('0x' + hash_pad(size_bytes, data, hashfn)), Integer('0x' + binascii.hexlify(signature)) )

def find_n(*filenames):
    data_raw = []
    signature_raw = []
    for fn in filenames:
        data_raw.append( open(fn, 'rb').read() )
        signature_raw.append( open(fn+'.sig', 'rb').read() )
    size_bytes = len(signature_raw[0])
    if any(len(s) != size_bytes for s in signature_raw):
        raise Exception("All signature sizes must be identical")
    
    for hashfn in [hashlib.sha256]:
        pairs = [message_sig_pair(size_bytes, m, s, hashfn) for (m,s) in zip(data_raw, signature_raw)]
        for e in [0x10001, 3, 17]:
            gcd_input = [ (s^e - m) for (m,s) in pairs ]
            result = gcd(*gcd_input)
            if result != 1:
                return (hashfn, e, result)

If we test it, we’ll find:

time hashfn, e, n = find_n('hallowelt.txt', 'hallootto.txt');

CPU times: user 27.3 s, sys: 609 ms, total: 27.9 s
Wall time: 28.4 s

print hex(n)

d9dac509621ed7f27b4868ab1874f649778c63f11000366e827cf18fd70db1e27f39902524e29aa2bfb3167627caaa408e17e907ee3c44e0321dc77fb8890075

And compare to the ground truth of our example:

$ openssl rsa -in privkey.pem -noout -text
Private-Key: (512 bit)
modulus:
    00:d9:da:c5:09:62:1e:d7:f2:7b:48:68:ab:18:74:
    f6:49:77:8c:63:f1:10:00:36:6e:82:7c:f1:8f:d7:
    0d:b1:e2:7f:39:90:25:24:e2:9a:a2:bf:b3:16:76:
    27:ca:aa:40:8e:17:e9:07:ee:3c:44:e0:32:1d:c7:
    7f:b8:89:00:75
publicExponent: 65537 (0x10001)
[...]

Finally, to create a standard format PEM public key from our n and e:

from Crypto.PublicKey import RSA
print RSA.construct( (long(n), long(e)) ).exportKey(format='PEM')

—–BEGIN PUBLIC KEY—–
MFwwDQYJKoZIhvcNAQEBBQADSwAwSAJBANnaxQliHtfye0hoqxh09kl3jGPxEAA2
boJ88Y/XDbHifzmQJSTimqK/sxZ2J8qqQI4X6QfuPETgMh3Hf7iJAHUCAwEAAQ==
—–END PUBLIC KEY—–

Which is exactly what we would get from OpenSSL:

$ openssl rsa -in privkey.pem -pubout
writing RSA key
-----BEGIN PUBLIC KEY-----
MFwwDQYJKoZIhvcNAQEBBQADSwAwSAJBANnaxQliHtfye0hoqxh09kl3jGPxEAA2
boJ88Y/XDbHifzmQJSTimqK/sxZ2J8qqQI4X6QfuPETgMh3Hf7iJAHUCAwEAAQ==
-----END PUBLIC KEY-----

Warning: The gcd method may sometimes return not n but a product k * n for a smallish value of k. You may need to check for small prime factors and remove them.

The code is available as a SageMath Jupyter notebook: rsa_find_n.ipynb (all example files).

The find_n function is written to accept arbitrary arguments (must be filenames where the file contains the data and the filename appended with .sig contains the signature) but will only work when given exactly two arguments. FactHacks: Batch gcd has a batchgcd_faster function that will work on an arbitrary number of arguments (but is slower in the 2‑argument case).

Power Sockets of India

I almost always have my trusty earthed world travel adapter with me when I’m traveling, which allows me to have a proper German Schutzkontakt power socket in most parts of the world:
Earthed World Travel AdapterThat’s why I was very confused when, on a recent trip to India, I wasn’t immediately able to plug in, and became fascinated with the multitude of different power sockets I saw over there.

The first type of socket I encountered were these, and they match none of the configurations on my adapter:Indian Power Sockets 1As it turns out, these are dual sockets for IS 1293 (which basically is a form of the old British standard BS 546). The old British system of sockets had different sockets and plugs for different amperages! A 5A plug would fit only in a 5A socket, and a 15A plug was different and would fit only in a 15A socket. Since this is a very stupid usability problem (what do you do if you want to connect a 5A appliance but only have a 15A socket?), apparently all newer sockets would always be rated at the higher amperage and include multiple holes to also connect lower amperage plugs.

I’ve never seen this type when visiting the UK, and obviously my adapter doesn’t support that. So, no power for me. Also note that even though the lowermost set of holes would in principle be able to receive a Europlug, this doesn’t work, because there’s a plastic shutter that is actuated by the grounding pin.

The next type of socket I saw had a slightly more intelligent layout:
Indian Power Socket 2This one also supports both IS 1293 pin configurations, but by slightly moving the lower amperage holes they also made this thing fit a ‘normal’ British BS 1363 (aka Type G) plug which I had available on my adapter.

The sockets in my hotel were obviously geared more towards non-domestic plugs:
Indian Power Socket 3This will fit: Europlug, British BS 1363, IS 1293 (not sure if it’s good for both amperage configurations or just one of them), American NEMA, and Australian.

The sockets at the airport were a slightly advanced version, adding a center pin to the Europlug (not sure if Brazilian, Italian, or Swiss):
Indian Power Sockets 4But the true king of sockets I found in the offices I visited:
Indian Power Socket 5This thing is compatible with any of the standards listed above, including all possible configurations of my travel adapter. Or, in the words of a wise man:

The nice thing about standards is that you have so many to choose from. Andrew S. Tanenbaum, Computer Networks

NGINX, FastCGI, HTTPS, and WordPress

I have a setup in which a frontend nginx server handles multiple names on one IP for HTTP port 80 and HTTPS port 443 (through SNI) and forwards the requests to distinct backend HTTP servers, based on name and/or path. One of these backends is a WordPress installation, and that one is problematic: WordPress tends to insert absolute URLs into everything. And since it doesn’t know that the frontend was accessed through HTTPS, it inserts HTTP URLs. Worse, when using the WordPress HTTPS plugin and choosing to force administrative logins through HTTPS only, you can end up with an endless redirect loop.

A quasi standard signal from frontend to backend in this case is the X‑Forwarded-Ssl HTTP header, which, when set to on, should indicate that the frontend used HTTPS. Unfortunately, WordPress ignores the header. Instead, is_ssl() only checks the HTTPS server variable (and alternatively checks the server port to be 443).

The WordPress documentation referenced above offers a simple fix by overwriting the HTTPS variable in the wp-config.php file. Instead, for nginx with FastCGI, I propose this simpler, more generic, and, in my humble opinion, more elegant fix:

The default nginx FastCGI configuration includes a file fastcgi_params, which, among others, has this line:

fastcgi_param   HTTPS                   $https if_not_empty;

Below this simply add

fastcgi_param   HTTPS                   $http_x_forwarded_ssl if_not_empty;

(and of course have something along the lines of

proxy_set_header X-Forwarded-Ssl on;

in the corresponding proxy configuration on the frontend nginx)

This will set the HTTPS server variable when the X‑Forwarded-Ssl HTTP header is set, allowing all kinds of PHP programs to transparently discover that they’re used to through HTTPS.