We just started using anaconda, rather than pip/virtualenv, to manage dependencies in the codebase for our data warehouse. The combination of pip and virtualenvs with requirements.txt files has served us well, but we switched because conda is more standard for analytics work, and because it’s by far the easiest way to make numpy and other data and science libraries available without going through a bunch of compiler pain. pip/pypi is making great progress with binary distribution via wheels, but it’s not quite there yet.

We use docker to deploy our data warehouse, via Elastic Beanstalk. This meant that in order to switch to anaconda, we had to figure out how to make it play nice with docker. Lucky for us, Continuum has official docker images for anaconda and miniconda. However, there’s an important use case that Continuum doesn’t fully cover: using conda environments.

In many ways, conda and virtualenv/pip solve similar problems, but the way they think about environments and dependencies have some subtle but important consequences.

With pip, you can use requirements.txt to script telling pip to install dependencies regardless of context. In other words, pip will happily install dependencies from requirements.txt in the global python install. Separately, one may create a virtualenv, which pip will treat the same as it would a global python install. The upshot of all this is that you can use your requirements.txt files with docker on a stock python image without bothering with virtualenvs. This makes docker with pip really easy!

With conda, environments act a bit differently. Environments and dependency installs are interlinked, meaning that you can’t use an environment.yml file to manage global dependencies. The mental model here is that one generates an environment.yml file based on the environment that they used to run some analysis, which is exported and shared with other scientists wanting to reproduce the results of a codebase. This is actually very similar to a common workflow for pip centered around pip freeze. What this means is that, unlike the case with pip, use of an environment.yml file necessitates using conda environments, even in docker.

As far as I can tell, Continuum didn’t create their docker container with this workflow in mind. Their writeup suggests calling “conda install” directly, which works but doesn’t take advantage of the environment.yml file that we’re already using to manage dependencies locally. Luckily, with a little bit of work the official images can be used to manage running conda environments inside docker containers.

Like virtualenv, conda environments ultimately work by creating a directory tree for your environment and setting some environment variables to make everything work. Though we could probably recreate the proper environment variables with the Dockerfile ENV directive, the most straightforward way of using a conda env is ultimately to use bash to source the script and then exec the python script in the created environment:

/bin/bash -c “source activate your-environment && exec python application.py” 

“your-environment”, in this case, and in other code snippets, is the name of the environment being used by conda. Your environment will very likely have a different name (ours does!).

Note that this is being ran with bash, specifically, rather than sh. Conda explicitly supports bash (amongst other shells), and explicitly doesn’t support sh.

This approach generally works inside the docker image, though there a few important issues with the docker image that need to be addressed:

  • Docker by default runs shell commands with /bin/sh, which isn’t supported by the conda activate command. This can be sidestepped either by setting the SHELL directive in your Dockerfile, or if your version of docker (like the one used by Elastic Beanstalk) doesn’t support it (as it’s relatively new), use the exec form of RUN instead and explicitly run /bin/bash -c.
  • The anaconda images set an ENTRYPOINT that uses tini. This is great if you want to run python in your CMD directly, but less great if you want to run a bash command instead. Luckily, this is easy to work around: Just redefine ENTRYPOINT to /bin/bash -c and combine that with an exec form of CMD that contains the bash command as the first and only element.

Putting all of this together, the Dockerfiles we use with Elastic Beanstalk looks something like this:

FROM continuumio/miniconda3

# Set the ENTRYPOINT to use bash
# (this is also where you’d set SHELL,
# if your version of docker supports this)
ENTRYPOINT [ “/bin/bash”, “-c” ]

EXPOSE 5000

# Conda supports delegating to pip to install dependencies
# that aren’t available in anaconda or need to be compiled
# for other reasons. In our case, we need psycopg compiled
# with SSL support. These commands install prereqs necessary
# to build psycopg.
RUN apt-get update && apt-get install -y \
libpq-dev \
build-essential \
&& rm -rf /var/lib/apt/lists/*

# Use the environment.yml to create the conda environment.
ADD environment.yml /tmp/environment.yml
WORKDIR /tmp
RUN [ “conda”, “env”, “create” ]

ADD . /code

# Use bash to source our new environment for setting up
# private dependencies—note that /bin/bash is called in
# exec mode directly
WORKDIR /code/shared
RUN [ “/bin/bash”, “-c”, “source activate your-environment && python setup.py develop” ]

WORKDIR /code
RUN [ “/bin/bash”, “-c”, “source activate your-environment && python setup.py develop” ]

# We set ENTRYPOINT, so while we still use exec mode, we don’t
# explicitly call /bin/bash
CMD [ “source activate your-environment && exec python application.py” ]


That’s it! Now, I’m no docker wizard, so it’s possible I missed a nice trick or two. If you notice something, drop me a line in the comments!