We’ve recently begun dockerizing our applications in an effort to make development and deployment easier. One of the challenges was establishing a good baseline Dockerfile
which can maximize the benefits of Dockers caching mechanism and at the same time provide minimal application images without any superfluous contents.
The basic installation flow for any Django project (let’s call it foo
) is simple enough:
export DJANGO_SETTINGS_MODULE=foo.settings pip install -r requirements.txt python manage.py collectstatic python manage.py compilemessages python manage.py migrate
(Note: In this blog post we’ll mostly ignore the commands to actually get the Django project running within a web server. We’ll end up using gunicorn
with WSGI, but won’t comment further on it.)
This sequence isn’t suitable for a Dockerfile
as-is, because the final command in the sequence creates the database within the container image. Except for very specific circumstances this is likely not desired. In a normal deployment the database is located either on a persistent volume mounted from outside, or in another container completely.
First lesson: The Django migrate
command needs to be part of the container start script, as opposed to the container build script. It’s harmless/idempotent if the database is already fully migrated, but necessary on the first container start, and on every subsequent update that includes database migrations.
Baseline Dockerfile
A naive Dockerfile
and accompanying start script would look like this:
# Dockerfile FROM python:slim ENV DJANGO_SETTINGS_MODULE foo.settings RUN mkdir -p /app WORKDIR /app COPY . . RUN pip install -r requirements.txt gunicorn RUN python manage.py collectstatic RUN python manage.py compilemessages ENTRYPOINT ["/app/docker-entrypoint.sh"]
# docker-entrypoint.sh cd /app python manage.py migrate exec gunicorn --bind '[::]:80' --worker-tmp-dir /dev/shm --workers "${GUNICORN_WORKERS:-3}" foo.wsgi:application
(The --worker-tmp-dir
bit is a workaround for the way Docker mounts /tmp
. See Configuring Gunicorn for Docker.)
This approach does work, but has two drawbacks:
- Large image size. The entire source checkout of our application will be in the final docker image. Also, depending on the package requirements we may need to
apt-get install
a compiler or development package before executingpip install
. These will then also be in the final image (and on our production machine). - Long re-build time. Any change to the source directory will invalidate the Docker cache starting with line 6 in the
Dockerfile
. Thepip install
will be executed fully from scratch every time.
(Note: We’re using the slim Python docker image. The alpine image would be even smaller, but its use of the musl
C library breaks some Python modules. Depending on your dependencies you might be able to swap in python:alpine
instead of python:slim
.)
Improved Caching
Docker caches all individual build steps, and can use the cache when the same step is applied to the same current state. In our naive Dockerfile
all the expensive commands are dependent on the full state of the source checkout, so the cache cannot be used after even the tiniest code change.
The common solution looks like this:
# Dockerfile FROM python:slim ENV DJANGO_SETTINGS_MODULE foo.settings RUN mkdir -p /app WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt gunicorn COPY . . RUN python manage.py collectstatic RUN python manage.py compilemessages ENTRYPOINT ["/app/docker-entrypoint.sh"]
In this version the pip install
command on line 7 can be cached until the requirements.txt
or the base image change. Re-build time is drastically reduced, but the image size is unaffected.
Building with setup.py
If we package up our Django project as a proper Python package with a setup.py
, we can use pip to install it directly (and could also publish it to PyPI).
If the setup.py
lists all project dependencies (including Django) in install_requires
, then we’re able to execute (for example in a virtual environment):
pip install .
This will pre-compile and install all our dependencies, and then pre-compile and instal all our code, and install everything into the Python path. The main difference to the previous versions is that our own code is pre-compiled too, instead of just executed from the source checkout. There is little immediate effect from this: The interpreter startup might be slightly faster, because it doesn’t need to compile our code every time. In a web-app environment this is likely not noticeable.
But because our dependencies and our own code are now properly installed in the same place, we can drop our source code from the final container.
(We’ve also likely introduced a problem with non-code files, such as templates and graphics assets, in our project. They will by default not be installed by setup.py
. We’ll take care of this later.)
Due to the way Docker works, all changed files of every build step cumulatively determine the final container size. If we install 150MB of build dependencies, 2MB of source code and docs, generate 1MB of pre-compiled code, then delete the build dependencies and source code, our image has grown by 153MB.
This accumulation is per step: Files that aren’t present after a step don’t count towards the total space usage. A common workaround is to stuff the entire build into one step. This approach completely negates any caching: Any change in the source files (which are necessarily part of the step) also requires a complete redo of all dependencies.
Enter multi-stage build: At any point in the Dockerfile
we’re allowed to use a new FROM
step to create a whole new image within the same file. Later steps can refer to previous images, but only the last image of the file will be considered the output of the image build process.
How do we get the compiled Python code from one image to the next? The Docker COPY
command has an optional --from=
argument to specify an image as source.
Which files do we copy over? By default, pip
installs everything into /usr/local
, so we could copy that. An even better approach is to use pip install --prefix=...
to install into an isolated non-standard location. This allows us to grab all the files related to our project and no others.
# Dockerfile FROM python:slim as common-base ENV DJANGO_SETTINGS_MODULE foo.settings # Intermediate image, all compilation takes place here FROM common-base as builder RUN pip install -U pip setuptools RUN mkdir -p /app WORKDIR /app RUN apt-get update && apt-get install -y build-essential python3-dev RUN mkdir -p /install COPY . . RUN sh -c 'pip install --no-warn-script-location --prefix=/install .' RUN cp -r /install/* /usr/local RUN sh -c 'python manage.py collectstatic --no-input' # Final image, just copy over pre-compiled files FROM common-base RUN mkdir -p /app COPY docker-entrypoint.sh /app/ COPY --from=builder /install /usr/local COPY --from=builder /app/static.dist /app/static.dist ENTRYPOINT ["/app/docker-entrypoint.sh"]
This will drastically reduce our final image size since neither the build-essential
packages, nor any of the source dependencies are part of it. However, we’re back to our cache-invalidation problem: Any code change invalidates all caches starting at line 17, requiring Docker to redo the full Python dependency installation.
One possible solution is to re-use the previous trick of copying the requirements.txt
first, in isolation, to only install the dependencies. But that would mean we need to manage dependencies in both requirements.txt
and setup.py
. Is there an easier way?
Multi-Stage, Cache-Friendly Build
The command setup.py egg_info
will create a foo.egg-info
directory with various bits of information about the package, including a requirements.txt
.
We’ll execute egg_info
in an isolated image, copy the requirements.txt
to a new image (in order to be independent from changes in setup.py
other than the list of requirements), then install dependencies using the generated requirements.txt
. Up to here these steps are fully cacheable unless the list of project dependencies changes. Afterwards we’ll proceed in the usual fashion by copying over the remaining source code and installing it.
(One snap: The generated requirements.txt
also contains all possible extras listed in setup.py
, under bracket separated sections such as [dev]
. pip
cannot handle that, so we’ll use grep
to cut the generated requirements.txt
at the first blank line.)
# Dockerfile FROM python:slim as common-base ENV DJANGO_SETTINGS_MODULE foo.settings FROM common-base as base-builder RUN pip install -U pip setuptools RUN mkdir -p /app WORKDIR /app # Stage 1: Extract dependency information from setup.py alone # Allows docker caching until setup.py changes FROM base-builder as dependencies COPY setup.py . RUN python setup.py egg_info # Stage 2: Install dependencies based on the information extracted from the previous step # Caveat: Expects an empty line between base dependencies and extras, doesn't install extras # Also installs gunicon in the same step FROM base-builder as builder RUN apt-get update && apt-get install -y build-essential python3-dev RUN mkdir -p /install COPY --from=dependencies /app/foo.egg-info/requires.txt /tmp/ RUN sh -c 'pip install --no-warn-script-location --prefix=/install $(grep -e ^$ -m 1 -B 9999 /tmp/requires.txt) gunicorn' # Everything up to here should be fully cacheable unless dependencies change # Now copy the application code COPY . . # Stage 3: Install application RUN sh -c 'pip install --no-warn-script-location --prefix=/install .' # Stage 4: Install application into a temporary container, in order to have both source and compiled files # Compile static assets FROM builder as static-builder RUN cp -r /install/* /usr/local RUN sh -c 'python manage.py collectstatic --no-input' # Stage 5: Install compiled static assets and support files into clean image FROM common-base RUN mkdir -p /app COPY docker-entrypoint.sh /app/ COPY --from=builder /install /usr/local COPY --from=static-builder /app/static.dist /app/static.dist ENTRYPOINT ["/app/docker-entrypoint.sh"]
Addendum: Handling data files
When converting your project to be installable with setup.py
, you should make sure that you’re not missing any files in the final build. Run setup.py egg_info
and then check the generated foo.egg-info/SOURCES.txt
for missing files.
A common trip-up is the distinction between Python packages and ordinary directories. By definition a Python package is a directory that contains an __init__.py
file (can be empty). By default setup.py only installs Python packages. So make sure you’ve got __init__.py
files also on all intermediate directory levels of your code (check in management/commands
, for example).
If your project uses templates or other data files (not covered by collectstatic
), you need to do two things to get setup.py
to pick them up:
- Set
include_package_data=True
in the call tosetuptools.setup()
insetup.py
. - Add a
MANIFEST.in
file next tosetup.py
that contains instructions to include your data files.
The most straightforward way for a template directory is something likerecursive-include foo/templates *
The section on Including Data Files in the setuptools documentation covers this in more detail.