Docker Best Practices: Read-Only Containers

“It works on my machine.”
“OK, we’ll ship your machine.”
The origins of Docker 😉

The core value proposition of Linux containers (e.g. Docker) is to provide an isolated, repeatable execution environment for software. To this end, a container image provides a fully prepared root filesystem, together with a recommended execution command line, that are suitable for running one instance of the software, called a container.

The distinction is important: An image is an immutable artifact, fully checksummed, possibly signed, and used to create an arbitrary number of container instances, that all behave identically when executed. The glue in-between is the container runtime which sets up containers from images to resemble mostly normal-looking computing systems that can execute the software stored on their root filesystems.

Containers look like normal environments

One of the key ideas involved here is that the environment inside a container looks so much like a normal (Linux) environment that most software will run unaltered. In the general case, programs require no modification, and no knowledge of containers at all, to successfully run in a containerized environment. One concession to this execution model, that the runtime makes, is that it will allow writes to the container root filesystem (subject to normal access control permissions) since this is something that is usually possible in normal environments. Obviously a container must not be allowed to write to the underlying image (affecting other containers based on the same image), so these writes are private to each container.

How each container runtime achieves this is an implementation detail that’s not relevant to the general discussion. In the case of Docker on Linux, it will use union mounts to create a stack with the common read-only image below and an empty container-specific read-write directory at the top. Writes go to the container-specific directory, while reads come either from there or from the image.

In fact, this mechanism is applied recursively: The “image” is also just a stack of union mounts of different layers. The temporary directory of a container can be committed to become the top-most layer in a new image. This is how container images are originally built in the Docker model.

This is, however, not the only conceivable implementation. It’s fully possible that a runtime would just unpack the container image into an empty directory, which then serves as the container root directory, with no union mount necessary.

Containers are updated by updating images

One container is always bound to exactly one image, constituting its root directory. There is no defined way to change (e.g. update) the image underlying a container. The solution to this is to just create a new container based on a new (updated) image and remove the old container. This works fine for software that is entirely stateless. The container is just an ephemeral artifact to provide the necessary environment for the code, and can be started and stopped and removed at will (usually under control of some container management system). This is true even if the software keeps state: All mutable state should be stored in volumes (usually implemented as bind mounts) outside of the container (and image). Usually the runtime even provides for temporary directories (on Linux: tmpfs) that are valid for only one boot and not persisted anywhere.

This convention provides for very clean separation of concerns:

Executable code is in the image: read-only, checksummed, updates under control of the container manager
Mutable state is on volumes: read-write, outside the container, can be backed up separately
Temporary state is in tmpfs, not persisted and handled automatically

Containers that modify themselves are problematic

Except for one problem: The container root filesystem itself is still read-write. Badly written, or misconfigured, software could still store state to the container itself. Worse, it might even modify code paths in the container. This has two possible consequences:

State stored in the container is lost on container re-creation (e.g. due to image update)
The execution environment is no longer repeatable, since it might depend on a previous execution of the container (different from other instances of the same image)

The solution is simple, but comes with a slew of consequences: Make the container root filesystem read-only. Docker has the --read-only command-line flag, Docker Compose has the read_only: true service configuration. In Docker, this has the effect of creating a container that contains just the image as root filesystem, and volume mounts, but cannot be written to. Self-evidently this is the right thing to do: All mutations should either go to mutable or temporary state, and the container should only contain code or immutable data and must never be changed anyway. Personally, one of my biggest gripes with Docker is that this is not the default mode of operation.

Strategies for handling read-only containers

Since read-only containers are not the Docker default, there are a lot of container images out there (many of them official Docker Hub images of their respective projects) that won’t work out of the box in read-only mode. And some won’t work at all.

Runtime directories

The most common –and legitimate– reason for a container image not working in read-only mode is due to the software requiring access to a writable /run or /var/run directory. Per Filesystem Hierarchy Standard, this should contain boot-level temporary state and it’s fully appropriate for software to need to write there. Unfortunately the current container image specification has no way to declare that a container image wants these directories, so they cannot be provided by the runtime automatically.
Solution: Find out which directories the application needs and use a tmpfs mount. For example for PostgreSQL in Docker Compose: tmpfs: ["/tmp", "/var/run/postgresql"]

Temporary state

tmpfs is also the solution to many other similar problems. Some software will want to write temporary state to weird directories and you’ll have to analyze the startup error message it gives and find out what it wants. For example I have Keycloak running with /opt/keycloak/data/tmp:mode=1777. At this stage you should also distinguish whether it really is temporary state or more likely mutable state to be kept across invocations and versions, which should be a normal data volume.

Commingled code

Some software, this is especially common with PHP projects, likes to commingle code and data, by writing to its own installation directory. Sometimes there are configuration directives you can set to point it to write somewhere else, into a data volume.
In some cases this problem cannot be solved at all, WordPress being among them. In that case, the container image is just used as a template for initializing a data volume with the code, and then code updates happen only within the data volume.

External configuration override

Some container images are prepared so that they apply external configuration to the installed software, by modifying configuration or startup files on container startup, before calling the target software itself. Preferably all such customization only happens in /tmp/ (see above), or can be configured to only happen in such temporary directories. This may be difficult in some cases, so sometimes you’ll want to apply the configuration by hand, externally to the container, and mount the resulting configuration file read-only into the container, disabling the built-in customization (e.g. when handling /etc/nginx/).

Lost causes

And finally, some container images are built in batshit reckless mode, running something like apt update && apt install foo && foo (or equivalent) at startup. That is, they install the software or some dependencies only at container runtime, sometimes every time. This of course throws out all benefits of reproducibility and reliability. The container image doesn’t even contain the software it’s supposed to run. Your only choice is to not use these container images at all and use the fallback below. And maybe give feedback to the image creator.

Benefits of read-only containers

Having the container root filesystem read-only fits into the general scheme of containerized deployment, and frankly should’ve been the default, and offers operational benefits:

Implementing write-xor-execute on the filesystem level (in combination with all data mounts being set to noexec, which is the default). This is important for a consistent, repeatable, reproducible (in conjunction with all configuration data) execution flow. All code executed as part of the container is part of the container image. When analyzing the execution environment for software vulnerabilities (f.e. in an SBOM approach), only the image needs to be analyzed. No code can come to be executed except through code paths present in the image.
Discouraging expensive container modifications at startup (where installing software is the worst case), to reduce startup time. An ideal container image is fully set up and prepared and will directly execute the target application at startup.
Cleanly separating code from data allows for targeted backup policies. Immutable code images can be backed up separately (for example at the registry level) and do not need to be part of expensive data backups.

Fall-back strategy for read-only containers

Since read-only images are not the Docker default, and many non-compliant images exist, you’ll need a strategy to handle these broken images. The general idea here is to create a new, local, image that is ready to run in read-only mode. We’ll have to bite one bullet, though, and in this solution we’ll have to do away with the separation between container build environment and run environment, at least in the simplest case. If you do have a fully set up container build environment and registry, you can use that instead of the integrated approach below.

Approach: Use Docker Compose to build a local image based on the original image that includes all container startup modifications in the image itself, and then execute this as a normal read-only container.

# compose.yml
services:
  app:
    pull_policy: build
    build:
      pull: true
      context: src/app
    read_only: true
    # Other configuration here

# src/app/Dockerfile
FROM original-app

RUN mkdir /foo/etc/pp  # Do what the original container image would do on container start

ENTRYPOINT /usr/bin/app  # Override the entrypoint to bypass the original container self-modification

Starting this Docker Compose project will build a custom image that includes the runtime modifications as part of the image creation, and then start this image in read-only mode. The exact modifications and command-line are specific to the individual case and may require some reverse engineering of the original image.

Or, of course, you can fully abandon the original image and just build a proper one yourself. Maybe send a pull request to the original maintainer then.

Making Good Bug Reports

Many, many years ago, this was with Bugzilla in the early 2000s, I got my first automated lecture on what constitutes a good bug report. I probably didn’t pay attention. Since then, I’ve seen this list countless times, in various levels of detail, across a broad array of systems:

What did you do?
What happened?
What did you expect to happen?

Over the last few years I’ve come to realize that this list is irreducible, if you’re losing one item you lose important context, and represents a kind of deep wisdom:

What Did You Do? — If we cannot see the steps that brought you into the situation, it’ll be hard to find the place in the program where it happens. It’ll also set up our mental model of the program in question to see what we think should happen.
Preferably this should be detailed and reliable enough to reproduce the problem on our side. Things that cannot be reliably reproduced are very hard to fix, because you’ll never truly know if they’re gone.
What happened? — This gives context on what happened for you, which might be different for us, indicating some other issue. In some cases this is what we thought should happen, so this also gives a clear statement to set up the next point.
What did you expect to happen? — Stating how your expectation differs from reality is what makes this a bug. You’re not reporting issues where the system does what you expected it should do. But this expectation might differ from what we were expecting. The issue need not be in the code or the implementation, but might be somewhere else. Maybe the documentation gave you a wrong idea on what should happen?

Sometimes a bug report can be succinct but still contain all three items: “I clicked on save. It did not save. I expected it to save.” Though in this case the first part really should be longer, because this is probably something that only happens under certain circumstances. And even if part 3 is only “I expected it to work”, that’s good to write out.

Bug reports consisting of a single screenshot, for example of an error message, are often not helpful. They, more or less, cover part 2, but leave out important context. It may not be obvious from the screenshot on how to get there. And it’s as likely as not that we think that this is the expected behavior. You should state why you think this error message is, as it were, in error.

The three parts of a good bug report are interlocking. Like describing the way to the train station to a stranger. You’re not going to describe it as “Turn left second street, go right first street, go right third street.” You’re giving context: “Go down this street and turn left at the second intersection, right behind the flower shop. You should see the church in front of you, turn immediately right and go into the small alley. If you then turn right at the third street you should see the train station in the distance.”

This is redundant. But redundancy is good. It allows for error checking and correction. It allows for there to be errors in both the environment and in its mental model or description thereof.

So, repeat after me: What did you do? What happened? What did you expect to happen?

Docker Deployment Best Practices

Given: There’s a CI system that automatically builds docker images from your VCS (e.g. git), we use self-hosted gitlab.
Goal: Both initial and subsequent automated deployments to different environments (staging and production).

Rejected Approaches

Most existing blog articles and howtos for this use case, specifically in the context of gitlab, tend to be relatively simple, relatively easy, and very very wrong. The biggest issue is with root access to the production server. I believe that developers (and the CI/CD system) should not have full root access to the production system(s), to retain semblance of separation in case of breaches. Yes, sure, a malicious developer could still check-in bad code which might eventually get deployed to production, but there is (should be) a review process before that, and traces in the VCS.

And yet, most recommendations on how to do deployment with gitlab circle around one of two approaches:

Install a gitlab “runner” on your production server. That is, an agent which gets commands from the gitlab server and executes them. This runner needs full root access (or, equivalently, docker daemon access), thus giving the gitlab server (and anyone who has/gains control over it) full root access to the production system(s).
This approach also needs meticulous management of the different runners, since they are now being used not just for build purposes but also have a second, distinct, duty for deployment.
Use your normal gitlab runners that are running somewhere else, but explicitly give them root access to the target servers, e.g. with a remote SSH login.
Again, this gives everybody in control of the gitlab server full production access, as well as anybody in control of one of the affected runners. Usually this is made less obvious by “only” giving docker daemon access, but that’s still equivalent to full root access.

There’s variants on this theme, like using Ansible for some abstraction, but it always boils down to somehow making it so that the gitlab server is capable of executing arbitrary commands as root on the production system.

Our Approach

For container management we’re going to use docker compose, the new one, not docker-compose. A compose.yaml file (with extensions, see below) is going to fully describe the deployment, and compose will take care of container management for updates.

Ideally we want to divide the task into two parts:

Initial setup
Continuous delivery

For the initial setup there’s not a pressing reason for full automation. We’re not setting up new environments all the time. There’s still some best practices and room for automation, see below, but in general it’s a one-time process executed with high privileges.

The continuous updates on the other hand should be fast, automated, and, above all, restricted. An update to a deployed docker application does exactly one thing: pull new image(s) then restart container(s).

Restricted SSH keys for update deployment

Wouldn’t it be great if we had an agent on the production server that could do that, and only that? Turns out, we have! Using additional configuration on ~/.ssh/authorized_keys we can configure a public key authenticated login that will only execute a (set of) predefined command(s), and nothing else¹. And since sshd is already running and exposed to the internet anyway, we don’t get any new attack surface.

The options we need are:

restrict to disable, roughly, all other functionality
command="cd ...; docker compose pull && docker compose up -d" which will make any login with that key execute only this command (you’ll need to fill in the path to cd into).

Using docker compose

In order for this to seamlessly work, there’s some best practices to follow when creating the environment:

All container configuration is handled by the docker compose framework
- Specifically: docker compose up just works.
  No weird docker compose -f compose.foo.yaml -f compose.bar.yaml -e WTF_AM_I_DOING=dunno up incantations.
The docker compose configuration should itself be version controlled
The containers come up by themselves in a usable configuration, and can handle container updates gracefully
- For example in Django, the django-admin migrate call must be part of the container startup
- In general it’s not allowed to need to manually execute commands in the containers or the compose environment for updates. You’re allowed to require one necessary initialization command on first setup, under extenuating circumstances only.
There’s also good container design (topic of a different blog post) with regards to separation of code and data

Good docker compose setup

There’s two ways to handle the main compose configuration of a project: As part of the git repository of one of the components, or as a separate git repository by itself.

The first approach applies if it’s a very simple project, maybe just one component. If it contains only the code you wrote, and possibly some ancillary containers like the database, then you’ll put the compose.yaml into the root directory of the main git repository. This also applies if your project consists of multiple components maintained by you, but it’s obvious which one is the main one (usually the most complex one).
Like if you have a backend container (e.g. Python wsgi), a frontend container (statically compiled HTML/JS, hosted by an nginx), a db container (standard PostgreSQL), maybe a cache, and some helper daemon (another Python project). Three of these are maintained by you, but the main one is the backend, so that’s where the compose.yaml lives.
For complex projects it makes sense to create a dedicated git repository that only hosts the compose file and associated files. This specifically applies if the compose file needs to be accompanied by additional configuration files to be mounted into the containers. These usually do not belong in your application’s git repository.

The idea here is that the main compose.yaml file (using includes is allowed) handles all the basic configuration and setup of the project, independent of the environment. Doing a docker compose up -d should bring up the project in some default state configured for a default environment (e.g. staging). Additional environment specific configuration should be placed in a compose.override.yaml file, which is not checked into git and which contains all the modifications necessary for a specific environment. Usually this will only set environment variables such as URL paths and API keys.

Additional points of note:

All containers should be configured read_only: true, possibly assisted by tmpfs: ["/run","/var/run/someapp"] or similar. If that’s not possible, go yell at the container image creator².
All configuration that is mounted from the outside should be mounted read-only
Data paths are handled by volumes
The directory name is the compose project name. That’s how you get the ability to deploy more than one instance of a project on the same host. The directory name should be short and to the point (e.g. frobnicator or maybe frobnicator-staging).
Ports in the main compose.yaml file are a problem, since port numbers are a global resource. A useful pattern is to not specify a port binding in the compose.yaml file and instead rely on compose.override.yaml for each deployment to specify a unique port for this deployment. That’s one of the few cases where it’s acceptable to absolutely require a compose.override.yaml for correct operation, and it must be noted in the README.

Putting It All Together

This example shows how to set up deployment for project transmogrify/frobnicator, hosted on gitlab at git.example.com, with the registry accessible as registry.example.com, to host deploy-host, using non-root (but still docker daemon capable) user deploy-user.

Preliminaries

On deploy-host, we’ll create a SSH public/private key to be used as deploy key for the git repository containing the main compose.yaml and configure docker pull access. This probably only needs to be done once for each target host.

ssh-keygen -t ed25519

Just hit enter for default filename (~/.ssh/id_ed25519) and no passphrase. Take the resulting public key (in ~/.ssh/id_ed25519.pub) and configure it in gitlab as a read-only deploy key for the project containing the compose.yaml (under https://git.example.com/transmogrify/frobnicator/-/settings/repository).

We’ll also need a deploy token for docker registry access. This should be scoped to access all necessary projects. In general this means you’ll want to keep all related projects in a group and create a group access token under https://git.example.com/groups/transmogrify/-/settings/access_tokens. Create the access token with a name of deploy-user@deploy-host, role Reporter and Scope read_registry.

Caveat 1: docker can only manage one set of login credentials per registry host. Either use non-privileged/user-space docker daemons separated by project (e.g. with different users on the deploy host, each one only managing one project), which is a topic of a different blog post. Or use a “Personal” Access Token for a global technical user which has access to all the necessary projects instead.
(There’s a third option: Create multiple .docker/config.json files and set the DOCKER_CONFIG environment variable accordingly. This violates the “docker compose up should just work” requirement.)

Caveat 2: docker really doesn’t want to store login credentials at all. There’s a couple of layers of stupidity here. Just do the following (note: this will overwrite all previously saved docker logins on this host, but you shouldn’t have any):

mkdir -p ~/.docker
echo '{"auths": {"registry.example.com": {"auth": ""}}}' > ~/.docker/config.json

Then you can do a normal login with the deploy token and it’ll work:

docker login registry.example.com

First deployment / Setup

Clone the repository and configure any overrides necessary. Then start the application.

git clone ssh://git@git.example.com/transmogrify/frobnicator.git
cd frobnicator
vi compose.override.yaml  # Or whatever is necessary
docker compose up -d

Your project should now be running. Finish any remaining steps (set up reverse proxy etc.) and debug whatever mistakes you made.

Set up for autodeploy

On deploy-host (or a developer laptop) generate another SSH key. We’re not going to keep it for very long, so do it like this:

ssh-keygen -t ed25519 -f pull-key  # Hit enter a couple of times for no passphrase

Also retrieve the SSH host keys from deploy host (possibly through another way):

ssh-keyscan deploy-host

On deploy-host, add the following line to ~deploy-user/.ssh/authorized_keys (where ssh-ed25519 AAA... is the contents of pull-key.pub from step 1):

restrict,command="cd /home/deploy-user/frobnicator; docker compose pull && docker compose up -d" ssh-ed25519 AAA....

In gitlab, on group level, configure variables (https://git.example.com/groups/transmogrify/-/settings/ci_cd):

Name	Value	Settings
`SSH_AUTH`	`-----BEGIN OPENSSH PRIVATE ...` (contents of `pull-key` from step 1)	File, Protected
`SSH_KNOWN_HOSTS`	results of step 2
`SSH_DEPLOY_TARGET`	`ssh://deploy-user@deploy-host`

You may use the environments feature of gitlab here, which will generally mean a different set of values per environment (and then choosing the environment in the job in the next step). Afterwards, delete the temporary files pull-key and pull-key.pub from step 1.

In your project’s or projects’ .gitlab-ci.yaml file (this is in the code projects, not necessarily the project containing compose.yaml), add this (the publish docker job is outside of the scope of this post):

stages:
  - build
  - deploy

# ...

deploy docker:
  image: docker:git
  stage: deploy
  cache: []
  needs:
    - publish docker
  before_script: |
    mkdir -p ~/.ssh
    echo "${SSH_KNOWN_HOSTS}" > ~/.ssh/known_hosts
    chmod -R go= ~/.ssh "${SSH_AUTH}"
  script: ssh -i "${SSH_AUTH}" -o StrictHostKeyChecking=no "${SSH_DEPLOY_TARGET}"

In order to handle multiple environments, you can also add

deploy docker:
  # ...
  rules:
    - if: $CI_COMMIT_BRANCH == "dev"
      variables:
        ENVIRONMENT: staging
    - if: "$CI_COMMIT_BRANCH =~ /^master|main$/"
      variables:
        ENVIRONMENT: production  
  environment: "${ENVIRONMENT}"

Voila. Every time after a docker image has been built, a gitlab runner will now trigger a docker compose pull/up, with minimal security impact since that’s the only thing it can do.

Addenda

The preliminaries and initial setup can be automated with Ansible.

You can put the gitlab-ci configuration into a common template file that can be referenced from all projects. For example, we have a common tools/ci repository, so the only thing necessary to get auto deployment is to put

stages:
  - publish
  - deploy

include:
  - project: tools/ci
    file: docker-ssh.yml

into a project and do the deploy-host setup and variable definitions (well, and add the other include files that handle the actual docker image building).

man authorized_keys, section “AUTHORIZED_KEYS FILE FORMAT” ↩︎
Their image is bad and they should feel bad. ↩︎