LibGuides: Code Sharing: Reproducible environment sharing

"Dependency hell"

Let's say you've written some code a few years in the past and you open it up again and it doesn't run. You wrote this code! Why isn't it running?

If you've updated the language or any packages used, things may have broken. Your code likely depends on packages, which depend on other packages, and so on down to base system functions and core language functions. A script written on another operating system for a previous version of the language and with outdated versions of packages may be quite difficult to successfully run on another machine. Even more perniciously, differences in operating systems and package versions may mean a script runs, but with differing outputs!

In attempting to repair this problem, you may have to roll back to previous versions of these dependencies, but in doing so, other conflicts may be created. You are now in what is known as "dependency hell".

How can we avoid this when we write and share code, for the sake of both our future selves and for other reusers of our code?

We can minimize dependencies, use conflict checkers to check our installed packages for dependency conflicts, and use tools like constraints files and careful documentation of version information, but there are also ways to avoid this issue entirely by freezing everything needed to run your code in its completed state.

Some of the ways this can be accomplished in a research setting include:

Subtopics:

Docker
CodeOcean
Binder
ReproZip

Virtual environment sharing

Virtual environments are a common way to isolate the packages needed for a specific piece of software from the "base" installation. Different virtual environments may have alternative versions of the base Python or R software, different packages, and the virtual environment can be "locked" into the current state to avoid package updates breaking dependencies. They can also usually be shared alongside the code in some way.

Common virtual environment management systems for Python and R include:

pyenv (Python)
poetry (Python)
uv (Python)
renv (R)
groundhog (R)

Containerization

Containerization is one of the most important concepts in modern software development and cloud computing. Essentially, containerization brings a separate and independent computing environment along with a piece of software, allowing it to run on almost any platform. The "container" is a very small, self-contained system where the software runs. Underlying parts of the operating system that the application draws are are included in the container, so they will be the same for any user.

Multiple pieces of software running in containers on the same machine would each be utilizing their own environment. In the context of reproducible research and code sharing, containerization helps package dependencies and remove friction as individual users may have very different machines and pre-installed software.

Docker
ReproZip
Binder

Docker

Docker is one the most popular software containerization tools, allowing you to package an application and its dependencies to be run almost anywhere. Docker has a friendly desktop interface as well as the traditional console-based interface. Docker images are usually stored in DockerHub, but can be uploaded to data and code repositories as well.

Code Ocean

Code Ocean is a platform designed to enable easy computational reproducibility by giving researchers a place to deposit containerized "code capsules" and data for other users to view, run on the Code Ocean site, and download to modify or peer-review in depth. The platform gives free users a limited amount of compute per month, but several publishers (including IEEE, Elsevier, and Springer Nature's computational journals) have partnerships that give authors the option of depositing a reproducibility package as a Code Ocean "capsule." Capsules can also be exported as Dockerfiles. Code Ocean's Open Science Library is one part of the overall platform, which offers a cloud-based platform for reproducible software development, but anyone can deposit to the OSL for free.

Code Ocean will also mint a Digital Object Identifier (DOI) for permanent linking to the code and has a partnership with CLOCKSS, an established research backup organization.

Code Ocean

Binder

Binder is a tool that allows jupyter notebooks and additional configuration files stored on GitHub to be opened in a Docker container on Binder's servers and interacted with in a live environment. This simplifies the process of environment sharing and allows for quick distribution of code for demos or review. However, while this is convenient for immediate-scope functional code sharing, it still depends on the continued existence of both GitHub and Binder services.

ReproZip

ReproZip is an open-source tool for containerization specifically created for reproducible computational research sharing. Once containerized in ReproZip, packages can be shared and opened in Docker's unpacker, other popular containerization platforms, or in the cloud with ReproServer.

Embedded code

Some journals and authors are experimenting with embedded functional code within journal articles.

Elife: Welcome to a new ERA of reproducible publishing

Makefiles

Make is a very old tool found in many Unix operating systems that constructs software from a text document that specifies how different pieces of source code should be executed. Makefiles are very lightweight and capable of doing things like downloading data from the internet, but the syntax is complex and the concept of writing a workflow backwards from complete product to first dependencies can be confusing. They also lack the fully system-independent replicability of a container.

Makefiles
Derivatives (snakemake, etc)

Why Use Make - Mike Bostock Make is a dependency-tracking and software-building tool found in most Unix operating systems. This blog post explains how makefiles can be created during the research process and used to share a replicable workflow.