“It works on my machine.”
The origins of Docker 😉
“OK, we’ll ship your machine.”
The core value proposition of Linux containers (e.g. Docker) is to provide an isolated, repeatable execution environment for software. To this end, a container image provides a fully prepared root filesystem, together with a recommended execution command line, that are suitable for running one instance of the software, called a container.
The distinction is important: An image is an immutable artifact, fully checksummed, possibly signed, and used to create an arbitrary number of container instances, that all behave identically when executed. The glue in-between is the container runtime which sets up containers from images to resemble mostly normal-looking computing systems that can execute the software stored on their root filesystems.
Containers look like normal environments
One of the key ideas involved here is that the environment inside a container looks so much like a normal (Linux) environment that most software will run unaltered. In the general case, programs require no modification, and no knowledge of containers at all, to successfully run in a containerized environment. One concession to this execution model, that the runtime makes, is that it will allow writes to the container root filesystem (subject to normal access control permissions) since this is something that is usually possible in normal environments. Obviously a container must not be allowed to write to the underlying image (affecting other containers based on the same image), so these writes are private to each container.
How each container runtime achieves this is an implementation detail that’s not relevant to the general discussion. In the case of Docker on Linux, it will use union mounts to create a stack with the common read-only image below and an empty container-specific read-write directory at the top. Writes go to the container-specific directory, while reads come either from there or from the image.
In fact, this mechanism is applied recursively: The “image” is also just a stack of union mounts of different layers. The temporary directory of a container can be committed to become the top-most layer in a new image. This is how container images are originally built in the Docker model.
This is, however, not the only conceivable implementation. It’s fully possible that a runtime would just unpack the container image into an empty directory, which then serves as the container root directory, with no union mount necessary.
Containers are updated by updating images
One container is always bound to exactly one image, constituting its root directory. There is no defined way to change (e.g. update) the image underlying a container. The solution to this is to just create a new container based on a new (updated) image and remove the old container. This works fine for software that is entirely stateless. The container is just an ephemeral artifact to provide the necessary environment for the code, and can be started and stopped and removed at will (usually under control of some container management system). This is true even if the software keeps state: All mutable state should be stored in volumes (usually implemented as bind mounts) outside of the container (and image). Usually the runtime even provides for temporary directories (on Linux: tmpfs) that are valid for only one boot and not persisted anywhere.
This convention provides for very clean separation of concerns:
- Executable code is in the image: read-only, checksummed, updates under control of the container manager
- Mutable state is on volumes: read-write, outside the container, can be backed up separately
- Temporary state is in tmpfs, not persisted and handled automatically
Containers that modify themselves are problematic
Except for one problem: The container root filesystem itself is still read-write. Badly written, or misconfigured, software could still store state to the container itself. Worse, it might even modify code paths in the container. This has two possible consequences:
- State stored in the container is lost on container re-creation (e.g. due to image update)
- The execution environment is no longer repeatable, since it might depend on a previous execution of the container (different from other instances of the same image)
The solution is simple, but comes with a slew of consequences: Make the container root filesystem read-only. Docker has the --read-only
command-line flag, Docker Compose has the read_only: true
service configuration. In Docker, this has the effect of creating a container that contains just the image as root filesystem, and volume mounts, but cannot be written to. Self-evidently this is the right thing to do: All mutations should either go to mutable or temporary state, and the container should only contain code or immutable data and must never be changed anyway. Personally, one of my biggest gripes with Docker is that this is not the default mode of operation.
Strategies for handling read-only containers
Since read-only containers are not the Docker default, there are a lot of container images out there (many of them official Docker Hub images of their respective projects) that won’t work out of the box in read-only mode. And some won’t work at all.
Runtime directories
The most common –and legitimate– reason for a container image not working in read-only mode is due to the software requiring access to a writable /run
or /var/run
directory. Per Filesystem Hierarchy Standard, this should contain boot-level temporary state and it’s fully appropriate for software to need to write there. Unfortunately the current container image specification has no way to declare that a container image wants these directories, so they cannot be provided by the runtime automatically.
Solution: Find out which directories the application needs and use a tmpfs
mount. For example for PostgreSQL in Docker Compose: tmpfs: ["/tmp", "/var/run/postgresql"]
Temporary state
tmpfs
is also the solution to many other similar problems. Some software will want to write temporary state to weird directories and you’ll have to analyze the startup error message it gives and find out what it wants. For example I have Keycloak running with /opt/keycloak/data/tmp:mode=1777
. At this stage you should also distinguish whether it really is temporary state or more likely mutable state to be kept across invocations and versions, which should be a normal data volume.
Commingled code
Some software, this is especially common with PHP projects, likes to commingle code and data, by writing to its own installation directory. Sometimes there are configuration directives you can set to point it to write somewhere else, into a data volume.
In some cases this problem cannot be solved at all, WordPress being among them. In that case, the container image is just used as a template for initializing a data volume with the code, and then code updates happen only within the data volume.
External configuration override
Some container images are prepared so that they apply external configuration to the installed software, by modifying configuration or startup files on container startup, before calling the target software itself. Preferably all such customization only happens in /tmp/
(see above), or can be configured to only happen in such temporary directories. This may be difficult in some cases, so sometimes you’ll want to apply the configuration by hand, externally to the container, and mount the resulting configuration file read-only into the container, disabling the built-in customization (e.g. when handling /etc/nginx/
).
Lost causes
And finally, some container images are built in batshit reckless mode, running something like apt update && apt install foo && foo
(or equivalent) at startup. That is, they install the software or some dependencies only at container runtime, sometimes every time. This of course throws out all benefits of reproducibility and reliability. The container image doesn’t even contain the software it’s supposed to run. Your only choice is to not use these container images at all and use the fallback below. And maybe give feedback to the image creator.
Benefits of read-only containers
Having the container root filesystem read-only fits into the general scheme of containerized deployment, and frankly should’ve been the default, and offers operational benefits:
- Implementing write-xor-execute on the filesystem level (in combination with all data mounts being set to
noexec
, which is the default). This is important for a consistent, repeatable, reproducible (in conjunction with all configuration data) execution flow. All code executed as part of the container is part of the container image. When analyzing the execution environment for software vulnerabilities (f.e. in an SBOM approach), only the image needs to be analyzed. No code can come to be executed except through code paths present in the image. - Discouraging expensive container modifications at startup (where installing software is the worst case), to reduce startup time. An ideal container image is fully set up and prepared and will directly execute the target application at startup.
- Cleanly separating code from data allows for targeted backup policies. Immutable code images can be backed up separately (for example at the registry level) and do not need to be part of expensive data backups.
Fall-back strategy for read-only containers
Since read-only images are not the Docker default, and many non-compliant images exist, you’ll need a strategy to handle these broken images. The general idea here is to create a new, local, image that is ready to run in read-only mode. We’ll have to bite one bullet, though, and in this solution we’ll have to do away with the separation between container build environment and run environment, at least in the simplest case. If you do have a fully set up container build environment and registry, you can use that instead of the integrated approach below.
Approach: Use Docker Compose to build a local image based on the original image that includes all container startup modifications in the image itself, and then execute this as a normal read-only container.
# compose.yml services: app: pull_policy: build build: pull: true context: src/app read_only: true # Other configuration here
# src/app/Dockerfile FROM original-app RUN mkdir /foo/etc/pp # Do what the original container image would do on container start ENTRYPOINT /usr/bin/app # Override the entrypoint to bypass the original container self-modification
Starting this Docker Compose project will build a custom image that includes the runtime modifications as part of the image creation, and then start this image in read-only mode. The exact modifications and command-line are specific to the individual case and may require some reverse engineering of the original image.
Or, of course, you can fully abandon the original image and just build a proper one yourself. Maybe send a pull request to the original maintainer then.