This document refers to at least three projects:
This document was originally intended as a response to this pex issue, which describes one way to directly integrate dependency graphs into a package manager or build tool.
I want to create a file format which specifies dependencies from any package manager. This file would be resolved into a directory by locating, downloading, and building dependencies. The intent is to allow deploying an application as a single runner script, which would resolve all dependencies (including the application's source code) and then immediately execute the main entry point. This is intended to work like ipex, but for applications written in any language and depending on any code from any package manager. After resolving once, the application would reuse the output for subsequent invocations.
This file format's specification language would be evaluated by a resolver program, with the following requirements:
- A resolver of this format must be able to bootstrap all of the application's dependencies from scratch on all desired deployment platforms.
- A resolver of this format must be able to correctly isolate the application and its dependencies from anything already installed on the deployment node, including prior versions of the same application.
If such a resolver program is feasible, then this text file format is hypothesized to demonstrate multiple improvements over many current deployment models. We have distinguished several parallel motivations:
We elaborate on each of these below.
The responsibility of resolving and bootstrapping dependencies is often unclearly and unevenly spread across:
- package managers
- build tools
- application developers
Typically, a package manager expects to interpret a dependency specification (e.g. tensorflow==2.5.0
) from the application developer, while a build tool expects to invoke the package manager and collate the results into an executable form (e.g. ./pants build
). However, if a package manager like pip needs to invoke a C compiler, the job of providing that C compiler goes to the build tool, or eventually to the application developer themself if the build tool does not exist or does not provide that. The package manager may even use its own build tool to prepare resolved packages (as in pip
's usage of setuptools
), which is separate from any build tool invoking the package manager. The lack of clear ownership over who provides what occasionally leads these entities to implement overlapping functionality, which may be buggy, incomplete, or frustratingly slow.
Hypothesis: By producing an abstract specification for dependency resolution and a resolver to evaluate it, the repeated work currently performed across separate package managers, build tools, and individual applications can be unified, amortizing the painstaking and error-prone bootstrapping of install prerequisites for every possible platform and use case. This benefits every contributor to the build process:
- build tool developers can differentiate their work by fantastic UX, instead of spending their time supporting a finite number of languages and platforms.
- package manager developers can implement more powerful dependency models and more efficient packaging methods for their supported language, instead of imperfectly recreating dependency mechanisms from other tools.
- manylinux specifically requires building on CentOS 7, and isn't able to represent builds from other OSes.
- application developers can choose the build infrastructure that best fits their budget and workflow, instead of jumping through hoops to get their code to run.
- the developers of such a resolver tool can implement it effectively by using whatever package managers or build tools they're most familiar with.
Additionally, by avoiding any dependence on container images, applications can be seamlessly composed together in any deployment scenario. Composing two orthogonal container images on the other hand is often nontrivial.
Deploying many nontrivial applications requires downloading a package for each dependency, then exposing them to the application at runtime via a language-specific process. When an application is updated, some or all of these packages may also need to be updated. Deploying such an application requires either:
- performing separate deployments for each dependency (such as when
pip install
populatessite-packages/
), - producing a single executable file containing all of the dependencies along with the application code (such as a PEX file).
It is often preferred to deploy as an immediately-executable single file. However, this often means it is unclear how or even whether the application has changed across versions, unless the deployment format is introspectable. This can be confusing for end users and potentially introduces a security risk.
Hypothesis: A text file format to completely specify all dependencies and how they should be composed into an application improves the deployment experience in multiple ways:
- application developers can upload a single small text file for all their end users, making the entire deployment process significantly faster and less error-prone.
- end users can readily audit exactly when and how the application has been updated when it is redeployed, reducing the chance of reusing an old version and making it significantly easier to audit for malicious changes.
Most applications are deployed assuming the developer pulls down all of the dependencies and collates them along with the application's source code into a format that can be executed by the end user without any build step. This approach allows the end user to avoid having to install any development tools in order to use the application. However, this approach also places the responsibility on the application developer to produce outputs which are compatible with all of the possible environments where an end user would want to execute the application, which becomes costly if that environment is proprietary software (like MacOS).
Hypothesis: By representing all of the necessary transitive dependencies in a unified DAG, and representing locally-installed tools the same way, subgraphs can be swapped out for alternatives preferred on the local system. This allows any compilation process to be recreated on the deployment node, resulting in the following improvements:
- application developers avoid publishing a matrix of large platform-specific binaries upon each update, and no longer need to cross-compile or reproduce every possible runtime environment themselves.
- end users can use a compiler that maximizes the application's runtime performance on their local system, and can ensure that any downloads when the DAG is resolved will point at their own private repositories if need be.
The rest of this document describes a strategy for implementing the above proposal.
There is a tool called spack (https://github.com/spack/spack) which has developed its own json/yaml-fungible format for representing dependency DAGs, e.g.:
# cosmicexplorer@breeze: ~/tools/spack 16:21:07
; spack spec -l py-setuptools
Input spec
--------------------------------
py-setuptools
Concretized
--------------------------------
y6p54tm [email protected]%[email protected] arch=linux-arch-skylake
xbxxina ^[email protected]%[email protected]+bz2+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4+uuid+zlib patches=0d98e93189bc278fbc37a50ed7f183bd8aaf249a8e1670a465f0db6bb4f8cf87 arch=linux-arch-skylake
5tughpd ^[email protected]%[email protected]+shared arch=linux-arch-skylake
bcrfpdf ^[email protected]%[email protected] arch=linux-arch-skylake
sjyxmk2 ^[email protected]%[email protected] arch=linux-arch-skylake
ooe2ajj ^[email protected]%[email protected]+libbsd arch=linux-arch-skylake
ovy6w4s ^[email protected]%[email protected] arch=linux-arch-skylake
owk3gd4 ^[email protected]%[email protected] arch=linux-arch-skylake
46w3vl6 ^[email protected]%[email protected] arch=linux-arch-skylake
vvbfvkb ^[email protected]%[email protected]~symlinks+termlib abi=none arch=linux-arch-skylake
e4wxnwh ^[email protected]%[email protected] arch=linux-arch-skylake
aanv2bv ^[email protected]%[email protected]+bzip2+curses+git~libunistring+libxml2+tar+xz arch=linux-arch-skylake
wtd53gl ^[email protected]%[email protected]~python arch=linux-arch-skylake
meslued ^[email protected]%[email protected]~pic arch=linux-arch-skylake
n7kyjrl ^[email protected]%[email protected]+optimize+pic+shared arch=linux-arch-skylake
66gbrsc ^[email protected]%[email protected] arch=linux-arch-skylake
4fiyzvd ^[email protected]%[email protected] patches=26f26c6f29a7ce9bf370ad3ab2610f99365b4bdd7b82e7c31df41a3370d685c0 arch=linux-arch-skylake
u2bmvyr ^[email protected]%[email protected]~docs+systemcerts arch=linux-arch-skylake
untmxcf ^[email protected]%[email protected]+cpanm+shared+threads arch=linux-arch-skylake
dpn5pad ^[email protected]%[email protected]~docs patches=b231fcc4d5cff05e5c3a4814f6a5af0e9a966428dc2176540d2c05aff41de522 arch=linux-arch-skylake
fj77hne ^[email protected]%[email protected]+column_metadata+fts~functions~rtree arch=linux-arch-skylake
niezlo7 ^[email protected]%[email protected] arch=linux-arch-skylake
Using spack spec --json
or spack spec --yaml
produces the desired dependency graph for pex-tool/pex#1137.
Now, the above is really cool for lots of reasons, not least because it demonstrates how spack can expect to bootstrap an arbitrary python version from an arbitrary other python version (including e.g. building python 3 on a centos6 py2.6 container). Spack also supports a huge number of processor architectures, along with several other peculiar constraints of HPC (see the paper at https://tgamblin.github.io/pubs/spack-sc15.pdf).
However, that's not relevant at all to this issue, and in fact nothing about this issue involves the use of spack at all. I'd mostly like to bring to people's attention the extremely general API/language for specifying dependencies that spack has developed, which has the following features:
See my in-progress more-complete description at spack/build-si-modeling#2.
-
Can represent the "concrete" result of a dependency resolution, as above. A "concretized" spec is a complete DAG where each node has an immutable
.dag_hash()
(on the left column of the above output). See the spack docs on this.- If the spec DAG in the above example had all of the
.dag_hash()
attributes magically removed, spack can re-resolve the package names/versions/variants from this spec DAG and deterministically recompute the exact same.dag_hash()
for each node, assuming the state of the package repository is not modified in the meantime. See #1086, #1176, #1249, #1282, and pypa/pip#53. - spack has a buildcache concept for serving pre-built binaries. Each such binary package is defined by a concrete
spec.yaml
, which also happens to have a tarball alongside it.
- If the spec DAG in the above example had all of the
-
Can represent an "abstract" query against a package, where the query is itself a spec DAG, but which hasn't already computed a
.dag_hash()
for each of its nodes. This "query" formulation is used in multiple ways:- It can be a binary filter to match against a given concrete spec -- this is how spack checks for whether a pre-built binary package is applicable to the current resolve. See #1093, #1202.
- It can be applied incrementally to a much larger concrete DAG, in order to resolve a concrete sub-DAG which satisfies the query. This is the process of "concretization".
- In the case of a failing resolve, this spec format can be used to represent the minimum "unsat core" which is causing a conflict. See #1200.
- Spack currently solves for this unsat core for its dependency model using a logic solver (see spack/spack#19501), but pip is perfectly able to do the same with its own new resolver. pypa/pip#7819 is an example of manipulating the new resolver to do some really drastic things.
And to more directly respond to this goal from the pex issue:
Its not clear if this should just be opaque text or more structured data.
- Can represent both concrete and abstract DAGs as either serializable json/yaml or as a fantastic command-line syntax (which looks like the output of
spack spec
above). It's this part in particular which I think makes the spack spec syntax really applicable and worth looking to directly as inspiration.- Among other things, the CLI syntax enables users to interactively re-formulate a query against their environment and re-run spack. An example session is shown at the bottom of this post. See pantsbuild/pants#7350.
The way we want pex to extract graph information from pip for this issue would be a prototype of the following:
Another reason I specifically mentioned spack here as an example is because as described in its dependency model docs, spack doesn't magically infer the dependencies it consumes, they're explicitly specified in package.py
files in the spack repo. So referring back to the initial command spack spec -l py-setuptools
above -- while you might expect spack to incorporate each python package's dependencies, currently spack mostly scorches the earth and requires mildly-painstaking care to reproduce pip dependencies within its own package syntax (see the package.py
for setuptools
). Note that the package names are completely uncorrelated with the name in the package's setup.py
(except by convention). Also, spack essentially resolves each python package in its own virtualenv (which the spack "concretizer" has no knowledge of).
However, these decisions actually make perfect sense if you recognize spack's primary use case is to be an extremely principled alternative to e.g. vendoring every C/C++ dependency in every project, or to using containers, neither of which allow composing a shared DAG from disparate sets of dependencies. Spack provides a ton of structure on top of that, but since C/C++ does not have any standardized package dependency specification, there's no immediate benefit to the spack project of trying to conform to another package manager's model, especially since I believe spack's model is currently a strict superset of every other tool (see spack/build-si-modeling#2).
The "ipex" lazy deployment mechanism of #789/pantsbuild/pants#8793 was a huge hit with Pex users at Twitter for several reasons, but I think most of those reasons are: like Pex itself, it's both a tool and a library, which gives it reliable behavior while also being composable into larger abstractions. In the case of ipex, this also means that engineers can really easily make changes to their deployment by modifying a single file in the repo: ipex_launcher.py
. ipex's "hackable" or malleable quality was a direct result of building on top of Pex.
To that last point: spack/build-si-modeling#2 was an attempt to point out:
- spack's dependency model is extremely general.
- spack covers the foundations of the build graph that no other tool tries to do in such a general and successful way (nix comes close -- see spack/spack#21282).
In particular, spack/build-si-modeling#2 demonstrates that a truly general model of package resolution isn't just about dependencies--we have to drill down into the way individual packages are installed in order to reason about their compatibilities!
Here's a quick summary of some tentative milestones, which get progressively more exciting:
-
Serialize a fully reproducible resolution.
- Solve #1086 using some json-fungible DAG specification syntax. Determine whether the abstract/concrete distinction can effectively represent a resolve request as well as a resolve lockfile.
- In particular, pypa/pip#53 has some fantastic discussion about serializing a pip resolve -- one reason why that issue has languished is because so far, imho, ipex has been the only really compelling use case mentioned (pypa/pip#53 (comment)). I think the framing in this pex issue seems to demonstrate this concept is actually high-level enough to reasonably own within Pex (?).
-
Bootstrap a build tool from scratch.
- If pants wants to be able to "take over" after some bootstrapping, it has to be able to consume some representation of that bootstrapping.
- Interoperability with other tools is precisely the goal of this serializable dependency graph format.
- One way to produce a generic bootstrapping process for pants is to externalize the process of a pants build so another tool can perform it.
- This is specifically where we would want to develop the ability to execute other resolvers in harmony!
-
Represent the complete provenance of packages which consume dependencies from multiple package managers.
- The provenance of many published packages (especially python native code) is extremely relevant to their actual compatibility, regardless of their stated compatibility. We would like to be able to express these dependencies in a single unified DAG. See coursier's example for JVM-only dependencies.
- jnr-ffi is a way to define FFI bindings for JVM applications, which by definition will depend on multiple separate languages and/or package managers to build. This is currently impossible. See one example use case with Rust and Scala.
- In particular, we could then represent the complete dependency graph of artifacts built from monorepos, so that they can be fungible with individual 3rdparty packages. Monorepos often have unpredictably-changing internal dependencies, which currently can't be expressed in any package manager. This means that consuming 3rdparty jars/wheels built from monorepo code (especially generated code such as thrift) can often break unpredictably without further provenance info.
- python's manylinux is one example -- tensorflow has a
tensorflow-gpu
version, but support for AVX512 instructions (or anything more complex) requires compiling tensorflow by hand using bazel, which like pants is a monorepo build tool.
- python's manylinux is one example -- tensorflow has a
- The provenance of many published packages (especially python native code) is extremely relevant to their actual compatibility, regardless of their stated compatibility. We would like to be able to express these dependencies in a single unified DAG. See coursier's example for JVM-only dependencies.
Example command-line session with spack demonstrating how a user can:
- find a local version of
ncurses
and register it in spack's database. - splice the
ncurses
(referring to it by its.dag_hash()
) into thepy-setuptools
spec above. - use "variants" (corresponding to "features" in pip lingo) to prune entire subtrees.
# cosmicexplorer@breeze: ~/tools/spack 16:47:38
; spack external find ncurses
==> The following specs have been detected on this system and added to /home/cosmicexplorer/.spack/packages.yaml
[email protected]
# cosmicexplorer@breeze: ~/tools/spack 16:47:58
; spack find -l ncurses
==> No package matches the query: ncurses
# cosmicexplorer@breeze: ~/tools/spack 16:48:03 $? 1
; spack spec [email protected]
Input spec
--------------------------------
[email protected]
Concretized
--------------------------------
[email protected]%[email protected]~symlinks+termlib abi=none arch=linux-arch-skylake
# cosmicexplorer@breeze: ~/tools/spack 16:48:15
; spack spec -l [email protected]
Input spec
--------------------------------
[email protected]
Concretized
--------------------------------
rynos3x [email protected]%[email protected]~symlinks+termlib abi=none arch=linux-arch-skylake
# cosmicexplorer@breeze: ~/tools/spack 16:48:23
; spack spec -l 'py-setuptools ^ [email protected]+optimizations~sqlite3~dbm ^ ncurses/rynos3x'
==> Error: No installed spec matches the hash: 'rynos3x'
# cosmicexplorer@breeze: ~/tools/spack 16:48:29 $? 1
; spack install [email protected]
==> Warning: Missing a source id for [email protected]
[+] /usr (external ncurses-6.2.20200212-rynos3xdk4umluuzsenowx5ptzvvquvm)
# cosmicexplorer@breeze: ~/tools/spack 16:48:45
; spack spec -l 'py-setuptools ^ [email protected]+optimizations~sqlite3~dbm ^ ncurses/rynos3x'
Input spec
--------------------------------
py-setuptools
^[email protected]%[email protected]~symlinks+termlib abi=none arch=linux-arch-skylake
^[email protected]~dbm+optimizations~sqlite3
Concretized
--------------------------------
swycnin [email protected]%[email protected] arch=linux-arch-skylake
d3utxvg ^[email protected]%[email protected]+bz2+ctypes~dbm~debug+libxml2+lzma~nis+optimizations+pic+pyexpat+pythoncmd+readline+shared~sqlite3+ssl~tix~tkinter~ucs4+uuid+zlib patches=0d98e93189bc278fbc37a50ed7f183bd8aaf249a8e1670a465f0db6bb4f8cf87 arch=linux-arch-skylake
5tughpd ^[email protected]%[email protected]+shared arch=linux-arch-skylake
bcrfpdf ^[email protected]%[email protected] arch=linux-arch-skylake
sjyxmk2 ^[email protected]%[email protected] arch=linux-arch-skylake
ooe2ajj ^[email protected]%[email protected]+libbsd arch=linux-arch-skylake
ovy6w4s ^[email protected]%[email protected] arch=linux-arch-skylake
sawguem ^[email protected]%[email protected]+bzip2+curses+git~libunistring+libxml2+tar+xz arch=linux-arch-skylake
dg4ebtg ^[email protected]%[email protected]~python arch=linux-arch-skylake
e4wxnwh ^[email protected]%[email protected] arch=linux-arch-skylake
ll7vwj7 ^[email protected]%[email protected]~pic arch=linux-arch-skylake
n7kyjrl ^[email protected]%[email protected]+optimize+pic+shared arch=linux-arch-skylake
rynos3x ^[email protected]%[email protected]~symlinks+termlib abi=none arch=linux-arch-skylake
66gbrsc ^[email protected]%[email protected] arch=linux-arch-skylake
4fiyzvd ^[email protected]%[email protected] patches=26f26c6f29a7ce9bf370ad3ab2610f99365b4bdd7b82e7c31df41a3370d685c0 arch=linux-arch-skylake
u2bmvyr ^[email protected]%[email protected]~docs+systemcerts arch=linux-arch-skylake
ph4z4gd ^[email protected]%[email protected]+cpanm+shared+threads arch=linux-arch-skylake
dpn5pad ^[email protected]%[email protected]~docs patches=b231fcc4d5cff05e5c3a4814f6a5af0e9a966428dc2176540d2c05aff41de522 arch=linux-arch-skylake
unaosui ^[email protected]%[email protected] arch=linux-arch-skylake
b3jflct ^[email protected]%[email protected] arch=linux-arch-skylake
niezlo7 ^[email protected]%[email protected] arch=linux-arch-skylake