Better Research Workflows: Managing Project Dependencies

February 7, 2025 — Apoorv Khandelwal [Blog]

Context: I am an AI researcher (currently a PhD student) and have been writing Python code for computer vision and NLP projects since 2018. While I actually really enjoy programming, I find writing code for research projects less satisfying and pretty tedious. The latter seems to be a very common sentiment among my peers: we spend a lot more time writing and debugging project boilerplate than actually building experiments! This post (hopefully, of several) begins to share how we can use modern tools in our workflows to both write less code and spend less time debugging.

Motivation

There's a common "joke" that "about 75% of doing a PhD is getting your python environment set-up" — also see xkcd#1987. When I first started doing AI research, I remember far too few agreed-upon practices. I went through the gauntlet: building libraries from scratch, putting up Docker containers, asking IT to sudo apt-get install packages for me, and so forth. Setting up projects took days. Since then, the NLP community has converged around the more unified transformers ecosystem, and the broader AI community is now mostly PyTorch-centric and Docker-free. These are great steps forward, but we are still a long way from "frustration-free packaging". Even recently, I tried cloning someone's project and found myself unsuccessfully debugging their (now out-of-date) conda installation for several hours. Eventually, I relented and rewrote the installation using the methods that I will describe in this post. I was successful after just 10 more minutes. By using these approaches, I've also encountered far fewer issues in my own projects over the last two years.

Approach

Our goal is to make your project very easy to set up and install. Imagine a fast, one-line installation that runs without error on any Linux machine, is fully reproducible, and won't go out-of-date. Unfortunately, this will eliminate your compile-time shenanigans (xkcd#303). But the upsides are tremendous: you'll save time when working on your own project, and others can then more easily adopt your codebase. The latter should be beneficial to both our scientific progress and your own impact! I subscribe to the following ~~commandments~~ guidelines to keep my projects buttoned-up. We will discuss tools that easily automate these in the next section.

Set up a virtual environment for your project. Your virtual environment should contain all necessary software to run your code. This is a complete and lightweight solution (esp. compared to Docker) and avoids the need for root permissions. The environment will live in your project directory (like [my_project]/.venv), which is easier to maintain than a faraway ~/envs/[my_project]. (a) Don't rely on sudo or software already installed on your server. Your machine may have python3.12 and CUDA 12.3; other machines may not. List these in your project dependencies and rely entirely on the virtual environment instead. (Refer to the appendices later for instructions about including non-Python dependencies, like CUDA.)
Put your project dependencies in pyproject.toml. This is the current standard for Python project metadata (so don't use requirements.txt, setup.py, or setup.cfg). To start, list all dependencies that you plan to directly import (+ anything else not implicitly covered) and pin their exact versions (e.g. torch==2.5.1). These are the set of dependencies you are guaranteeing your codebase will work for.

Lock your dependencies. Each dependency actually has it's own dependencies (and so forth), so we must fully resolve the list of packages to install.

Example resolution for ["torch==2.5.1", "torchvision==0.20.1", "numpy==2.2.2"]

my-project v0.1.0
├── numpy v2.2.2
├── torch v2.5.1
│   ├── filelock v3.16.1
│   ├── fsspec v2024.12.0
│   ├── jinja2 v3.1.5
│   │   └── markupsafe v3.0.2
│   ├── networkx v3.4.2
│   ├── nvidia-cublas-cu12 v12.4.5.8
│   ├── nvidia-cuda-cupti-cu12 v12.4.127
│   ├── nvidia-cuda-nvrtc-cu12 v12.4.127
│   ├── nvidia-cuda-runtime-cu12 v12.4.127
│   ├── nvidia-cudnn-cu12 v9.1.0.70
│   │   └── nvidia-cublas-cu12 v12.4.5.8
│   ├── nvidia-cufft-cu12 v11.2.1.3
│   │   └── nvidia-nvjitlink-cu12 v12.4.127
│   ├── nvidia-curand-cu12 v10.3.5.147
│   ├── nvidia-cusolver-cu12 v11.6.1.9
│   │   ├── nvidia-cublas-cu12 v12.4.5.8
│   │   ├── nvidia-cusparse-cu12 v12.3.1.170
│   │   │   └── nvidia-nvjitlink-cu12 v12.4.127
│   │   └── nvidia-nvjitlink-cu12 v12.4.127
│   ├── nvidia-cusparse-cu12 v12.3.1.170 (*)
│   ├── nvidia-nccl-cu12 v2.21.5
│   ├── nvidia-nvjitlink-cu12 v12.4.127
│   ├── nvidia-nvtx-cu12 v12.4.127
│   ├── setuptools v75.8.0
│   ├── sympy v1.13.1
│   │   └── mpmath v1.3.0
│   ├── triton v3.1.0
│   │   └── filelock v3.16.1
│   └── typing-extensions v4.12.2
└── torchvision v0.20.1
    ├── numpy v2.2.2
    ├── pillow v11.1.0
    └── torch v2.5.1 (*)
(*) Package tree already displayed

To avoid variance, we should freeze this resolution by writing it to a lockfile. Then, we'll always reproduce the virtual environment from this lockfile.

Put relevant environment variables in a .env file.

Setting up your project

Modern package managers — like uv and as opposed to de facto tools like pip — are actually more holistic, like project managers, and can automate the above steps for us. I'll just discuss a small set of features. One thing you'll notice is that uv is "10-100x faster than pip". uv is also becoming extremely popular.

uv's rapid adoption (expand)

Install uv (and restart your shell):

curl -LsSf https://astral.sh/uv/install.sh | sh

Create a new project & virtual environment.

uv init my-project --package --python "3.12"
cd my-project
uv add "torch==2.5.1" "torchvision==0.20.1" "numpy==2.2.2"  # or "uv sync" if not adding deps

We can also create a .env file: it's empty for now, but you can add any useful environment variables here: e.g. export TORCH_HOME=.cache/torch. We will load these every time we activate our project environmet.

touch .env

Something stylistic: I recommend putting re-usable functions (e.g. model or data loaders) in src/my_project and Python files that you plan to run from the terminal in scripts.

mkdir scripts

You can import those functions from my_project into your scripts. For example, we can create a scripts/hello_world.py that calls main from src/my_project/__init__.py. We'll run this later.

import my_project

if __name__ == "__main__":
    my_project.main()

Project Structure

my-project
├── .env
├── .git
├── .gitignore
├── pyproject.toml
├── .python-version
├── README.md
├── scripts
│   └── hello_world.py
├── src
│   └── my_project
│       └── __init__.py
├── uv.lock
└── .venv

src/my_project/__init__.py

def main() -> None:
    print("Hello from my-project!")

.python-version

3.12

pyproject.toml

[project]
name = "my-project"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
authors = [ ... ]
requires-python = ">=3.12"
dependencies = [
    "numpy==2.2.2",
    "torch==2.5.1",
    "torchvision==0.20.1",
]

[project.scripts]
my-project = "my_project:main"

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

Lockfile (uv.lock) Virtual Environment

.venv
├── bin
│   ├── activate
│   ├── activate.bat
│   ├── activate.csh
│   ├── activate.fish
│   ├── activate.nu
│   ├── activate.ps1
│   ├── activate_this.py
│   ├── convert-caffe2-to-onnx
│   ├── convert-onnx-to-caffe2
│   ├── deactivate.bat
│   ├── f2py
│   ├── isympy
│   ├── numpy-config
│   ├── proton
│   ├── proton-viewer
│   ├── pydoc.bat
│   ├── python -> ~/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/bin/python3.12
│   ├── python3 -> python
│   ├── python3.12 -> python
│   ├── torchfrtrace
│   └── torchrun
├── CACHEDIR.TAG
├── lib
│   └── python3.12
│       └── site-packages
│           ├── _distutils_hack
│           ├── distutils-precedence.pth
│           ├── filelock
│           ├── filelock-3.16.1.dist-info
│           ├── fsspec
│           ├── fsspec-2024.12.0.dist-info
│           ├── functorch
│           ├── isympy.py
│           ├── jinja2
│           ├── jinja2-3.1.5.dist-info
│           ├── markupsafe
│           ├── MarkupSafe-3.0.2.dist-info
│           ├── mpmath
│           ├── mpmath-1.3.0.dist-info
│           ├── networkx
│           ├── networkx-3.4.2.dist-info
│           ├── numpy
│           ├── numpy-2.2.2.dist-info
│           ├── numpy.libs
│           ├── nvidia
│           ├── nvidia_cublas_cu12-12.4.5.8.dist-info
│           ├── nvidia_cuda_cupti_cu12-12.4.127.dist-info
│           ├── nvidia_cuda_nvrtc_cu12-12.4.127.dist-info
│           ├── nvidia_cuda_runtime_cu12-12.4.127.dist-info
│           ├── nvidia_cudnn_cu12-9.1.0.70.dist-info
│           ├── nvidia_cufft_cu12-11.2.1.3.dist-info
│           ├── nvidia_curand_cu12-10.3.5.147.dist-info
│           ├── nvidia_cusolver_cu12-11.6.1.9.dist-info
│           ├── nvidia_cusparse_cu12-12.3.1.170.dist-info
│           ├── nvidia_nccl_cu12-2.21.5.dist-info
│           ├── nvidia_nvjitlink_cu12-12.4.127.dist-info
│           ├── nvidia_nvtx_cu12-12.4.127.dist-info
│           ├── PIL
│           ├── pillow-11.1.0.dist-info
│           ├── pillow.libs
│           ├── pkg_resources
│           ├── setuptools
│           ├── setuptools-75.8.0.dist-info
│           ├── sympy
│           ├── sympy-1.13.1.dist-info
│           ├── torch
│           ├── torch-2.5.1.dist-info
│           ├── torchgen
│           ├── torchvision
│           ├── torchvision-0.20.1.dist-info
│           ├── torchvision.libs
│           ├── triton
│           ├── triton-3.1.0.dist-info
│           ├── typing_extensions-4.12.2.dist-info
│           ├── typing_extensions.py
│           ├── _virtualenv.pth
│           └── _virtualenv.py
├── lib64 -> lib
├── pyvenv.cfg
└── share
    └── man
        └── man1
            └── isympy.1

How should you pick versions? I think it's usually safe to pin dependencies to their latest available version. And, you should choose a new-ish Python version supported by all your dependencies. For example, the latest PyTorch (torch==2.5.1) supports up to Python 3.12 (but not the very latest 3.13). So I usually go with --python "3.12".

Activate this environment in your shell with: source .venv/bin/activate && source .env You need to run this in every new terminal. The python command will then exclusively refer to our virtual environment. For example:
```
python -c "import torch; print(torch.__file__)"
# [...] .venv/lib/python3.12/site-packages/torch/__init__.py
```
You can now also run the script we made earlier: python scripts/hello_world.py!

Finally, you can use uv add or uv remove to update dependencies. Or edit pyproject.toml and run uv sync.

Tip: Avoid building libraries

Most libraries that you'll pip install are pre-built and packaged as "wheels". These are usually easy to download and extract.
However, some libraries are built when you install them. This can be slow and error-prone, especially if building the library involves compiling e.g. C++ or CUDA code. We should always prefer to install pre-built packages if possible.

Case 1: I was recently installing easynmt for machine translation. It depends on fasttext, which builds from C++ code at install time. The build failed on my system, because my system's compiler was too old to support C++17.
Instead, I found the pre-built fasttext-wheel package. I was then able to uv add easynmt fasttext-wheel while excluding fasttext via my pyproject.toml:

[tool.uv]
override-dependencies = [
    "fasttext ; sys_platform == 'never'",
]

Case 2: flash-attention actually does release pre-built wheels (for given CUDA/Python/PyTorch versions). The recommended installation method (uv pip install flash-attn --no-build-isolation) is supposed to install the corresponding wheel (if it exists) or build it otherwise. However — if the CUDA toolkit is not installed — the setup fails even when building is not necessary. This is not ideal, because installing the pre-built wheel does not actually need this toolkit (and neither does PyTorch).
I instead recommend manually selecting a wheel that corresponds with our project depedencies (e.g. Python 3.12, PyTorch 2.5, and CUDA 12 as identified by torch.version.cuda) and simply run:

# observe keywords in the file name: cu12 ... torch2.5 ... cp312
uv add "https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl"

This command installs the package flawlessly and in seconds!

Install in one-line

Suppose someone has uv and wants to use your codebase. They'll simply clone your repo and run:

uv sync

That's it! uv will reproduce your virtual environment on their machine. After they activate the environment (source .venv/bin/activate && source .env), they will be able to run your python scripts.

Conclusion

This post features virtually everything I know about managing Python project dependencies and best practices. There may be some learning curve to adopting these new tools, but I think it is more than worth it. I have a far easier time setting up projects (and ensuring those projects install everywhere) with these solutions. Please reach out (mail [at] apoorvkh.com) if you have any questions or suggestions!

Appendix

On GPUs and CUDA

You need three compatibile pieces of software to use GPUs.

Libraries: Libraries (like PyTorch) are built with a specific CUDA Toolkit version. So you should be sure to install a compatible set of library builds and CUDA toolkit versions. a. You can install different PyTorch builds from specific package indexes: e.g. https://download.pytorch.org/whl/cu121 for CUDA 12.1 builds. b. Recent versions of PyTorch are actually pre-bundled with the necessary CUDA runtime (a subset of the toolkit). So you don't need to install the toolkit if this is your only CUDA-dependent library.
CUDA Toolkit: Assuming your drivers are up-to-date, you can actually install any CUDA toolkit version to your virtual environment. See the next section for instructions. You can check if the toolkit is installed using nvcc -V.
Drivers: GPU drivers are the only software your admins needs to install to your system. These are backward-compatible with all prior CUDA Toolkit versions. You can check your driver version with nvidia-smi.

Non-Python dependencies

Sometimes, you'll need to depend on non-Python packages. For example, deepspeed actually does need the whole CUDA toolkit and also the gcc & gxx compilers. In this case, I recommend using pixi instead. Like uv, it can install Python packages from PyPI; but it can also install non-Python dependencies from conda repositories to the virtual environment.

We can install pixi with: curl -fsSL https://pixi.sh/install.sh | bash

Create a project and virtual environment (located at [my_project]/.pixi).

Since PyTorch 2.5.1 from PyPI is built with CUDA 12.4, we will install that version of the toolkit.

pixi init my-project --format pyproject
cd my-project

pixi project channel add "conda-forge" "nvidia/label/cuda-12.4.0"
pixi add "python=3.12.7" "cuda=12.4.0" "gcc=11.4.0" "gxx=11.4.0"
pixi add --pypi "torch==2.5.1" "torchvision==0.20.1" "deepspeed==0.16.2" "numpy==2.2.2"

mkdir scripts and add the following environment variables to your .env file:

export PYTHONNOUSERSITE="1"
export LIBRARY_PATH="$CONDA_PREFIX/lib"
export LD_LIBRARY_PATH="$CONDA_PREFIX/lib"
export CUDA_HOME="$CONDA_PREFIX"

Finally, we can activate this environment.
```
pixi shell
source .env
```

Install in one-line: clone and run pixi shell.

Commands in this post correspond to uv version 0.5.29 and pixi version 0.39.5; tested on a Linux (RHEL 9.2; x86_64) system. This post is written using a marimo Python notebook — [source].