Skip to content

Instantly share code, notes, and snippets.

@cosmicexplorer
Last active May 30, 2021 18:22
Show Gist options
  • Save cosmicexplorer/2f972e76949852fa5fad7c785e46b792 to your computer and use it in GitHub Desktop.
Save cosmicexplorer/2f972e76949852fa5fad7c785e46b792 to your computer and use it in GitHub Desktop.
cross-cultural resolution

Preface

This document refers to at least three projects:

  1. the pex packaging tool
  2. the spack package manager
  3. the pants build tool

This document was originally intended as a response to this pex issue, which describes one way to directly integrate dependency graphs into a package manager or build tool.

Vision: Deploying a Text File

I want to create a file format which specifies dependencies from any package manager. This file would be resolved into a directory by locating, downloading, and building dependencies. The intent is to allow deploying an application as a single runner script, which would resolve all dependencies (including the application's source code) and then immediately execute the main entry point. This is intended to work like ipex, but for applications written in any language and depending on any code from any package manager. After resolving once, the application would reuse the output for subsequent invocations.

Resolver Responsibilities

This file format's specification language would be evaluated by a resolver program, with the following requirements:

  1. A resolver of this format must be able to bootstrap all of the application's dependencies from scratch on all desired deployment platforms.
  2. A resolver of this format must be able to correctly isolate the application and its dependencies from anything already installed on the deployment node, including prior versions of the same application.

Motivation

If such a resolver program is feasible, then this text file format is hypothesized to demonstrate multiple improvements over many current deployment models. We have distinguished several parallel motivations:

  1. Reduce Duplicated Work
  2. Make Deployment Easier
  3. Make Deployment Better

We elaborate on each of these below.

Reduce Duplicated Work

The responsibility of resolving and bootstrapping dependencies is often unclearly and unevenly spread across:

  1. package managers
  2. build tools
  3. application developers

Typically, a package manager expects to interpret a dependency specification (e.g. tensorflow==2.5.0) from the application developer, while a build tool expects to invoke the package manager and collate the results into an executable form (e.g. ./pants build). However, if a package manager like pip needs to invoke a C compiler, the job of providing that C compiler goes to the build tool, or eventually to the application developer themself if the build tool does not exist or does not provide that. The package manager may even use its own build tool to prepare resolved packages (as in pip's usage of setuptools), which is separate from any build tool invoking the package manager. The lack of clear ownership over who provides what occasionally leads these entities to implement overlapping functionality, which may be buggy, incomplete, or frustratingly slow.

Amortize Efforts to Bootstrap Install Prerequisites Across Platforms

Hypothesis: By producing an abstract specification for dependency resolution and a resolver to evaluate it, the repeated work currently performed across separate package managers, build tools, and individual applications can be unified, amortizing the painstaking and error-prone bootstrapping of install prerequisites for every possible platform and use case. This benefits every contributor to the build process:

  • build tool developers can differentiate their work by fantastic UX, instead of spending their time supporting a finite number of languages and platforms.
  • package manager developers can implement more powerful dependency models and more efficient packaging methods for their supported language, instead of imperfectly recreating dependency mechanisms from other tools.
  • application developers can choose the build infrastructure that best fits their budget and workflow, instead of jumping through hoops to get their code to run.
  • the developers of such a resolver tool can implement it effectively by using whatever package managers or build tools they're most familiar with.

Container Composition Considered Nontrivial

Additionally, by avoiding any dependence on container images, applications can be seamlessly composed together in any deployment scenario. Composing two orthogonal container images on the other hand is often nontrivial.

Make Deployment Easier

Deploying many nontrivial applications requires downloading a package for each dependency, then exposing them to the application at runtime via a language-specific process. When an application is updated, some or all of these packages may also need to be updated. Deploying such an application requires either:

  1. performing separate deployments for each dependency (such as when pip install populates site-packages/),
  2. producing a single executable file containing all of the dependencies along with the application code (such as a PEX file).

It is often preferred to deploy as an immediately-executable single file. However, this often means it is unclear how or even whether the application has changed across versions, unless the deployment format is introspectable. This can be confusing for end users and potentially introduces a security risk.

Reduce a Deployment into a Small, Auditable Text File

Hypothesis: A text file format to completely specify all dependencies and how they should be composed into an application improves the deployment experience in multiple ways:

  • application developers can upload a single small text file for all their end users, making the entire deployment process significantly faster and less error-prone.
  • end users can readily audit exactly when and how the application has been updated when it is redeployed, reducing the chance of reusing an old version and making it significantly easier to audit for malicious changes.

Make Deployment Better

Most applications are deployed assuming the developer pulls down all of the dependencies and collates them along with the application's source code into a format that can be executed by the end user without any build step. This approach allows the end user to avoid having to install any development tools in order to use the application. However, this approach also places the responsibility on the application developer to produce outputs which are compatible with all of the possible environments where an end user would want to execute the application, which becomes costly if that environment is proprietary software (like MacOS).

Specialize the Build Process to the Final Deployment Environment

Hypothesis: By representing all of the necessary transitive dependencies in a unified DAG, and representing locally-installed tools the same way, subgraphs can be swapped out for alternatives preferred on the local system. This allows any compilation process to be recreated on the deployment node, resulting in the following improvements:

  • application developers avoid publishing a matrix of large platform-specific binaries upon each update, and no longer need to cross-compile or reproduce every possible runtime environment themselves.
  • end users can use a compiler that maximizes the application's runtime performance on their local system, and can ensure that any downloads when the DAG is resolved will point at their own private repositories if need be.

The rest of this document describes a strategy for implementing the above proposal.


Prior Art: Spack Spec Syntax

There is a tool called spack (https://github.com/spack/spack) which has developed its own json/yaml-fungible format for representing dependency DAGs, e.g.:

# cosmicexplorer@breeze: ~/tools/spack 16:21:07
; spack spec -l py-setuptools
Input spec
--------------------------------
py-setuptools

Concretized
--------------------------------
y6p54tm  [email protected]%[email protected] arch=linux-arch-skylake
xbxxina      ^[email protected]%[email protected]+bz2+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4+uuid+zlib patches=0d98e93189bc278fbc37a50ed7f183bd8aaf249a8e1670a465f0db6bb4f8cf87 arch=linux-arch-skylake
5tughpd          ^[email protected]%[email protected]+shared arch=linux-arch-skylake
bcrfpdf              ^[email protected]%[email protected] arch=linux-arch-skylake
sjyxmk2                  ^[email protected]%[email protected] arch=linux-arch-skylake
ooe2ajj          ^[email protected]%[email protected]+libbsd arch=linux-arch-skylake
ovy6w4s              ^[email protected]%[email protected] arch=linux-arch-skylake
owk3gd4          ^[email protected]%[email protected] arch=linux-arch-skylake
46w3vl6              ^[email protected]%[email protected] arch=linux-arch-skylake
vvbfvkb                  ^[email protected]%[email protected]~symlinks+termlib abi=none arch=linux-arch-skylake
e4wxnwh                      ^[email protected]%[email protected] arch=linux-arch-skylake
aanv2bv          ^[email protected]%[email protected]+bzip2+curses+git~libunistring+libxml2+tar+xz arch=linux-arch-skylake
wtd53gl              ^[email protected]%[email protected]~python arch=linux-arch-skylake
meslued                  ^[email protected]%[email protected]~pic arch=linux-arch-skylake
n7kyjrl                  ^[email protected]%[email protected]+optimize+pic+shared arch=linux-arch-skylake
66gbrsc              ^[email protected]%[email protected] arch=linux-arch-skylake
4fiyzvd          ^[email protected]%[email protected] patches=26f26c6f29a7ce9bf370ad3ab2610f99365b4bdd7b82e7c31df41a3370d685c0 arch=linux-arch-skylake
u2bmvyr          ^[email protected]%[email protected]~docs+systemcerts arch=linux-arch-skylake
untmxcf              ^[email protected]%[email protected]+cpanm+shared+threads arch=linux-arch-skylake
dpn5pad                  ^[email protected]%[email protected]~docs patches=b231fcc4d5cff05e5c3a4814f6a5af0e9a966428dc2176540d2c05aff41de522 arch=linux-arch-skylake
fj77hne          ^[email protected]%[email protected]+column_metadata+fts~functions~rtree arch=linux-arch-skylake
niezlo7          ^[email protected]%[email protected] arch=linux-arch-skylake

Using spack spec --json or spack spec --yaml produces the desired dependency graph for pex-tool/pex#1137.

Generalization: Abstract vs Concrete DAGs

Now, the above is really cool for lots of reasons, not least because it demonstrates how spack can expect to bootstrap an arbitrary python version from an arbitrary other python version (including e.g. building python 3 on a centos6 py2.6 container). Spack also supports a huge number of processor architectures, along with several other peculiar constraints of HPC (see the paper at https://tgamblin.github.io/pubs/spack-sc15.pdf).

However, that's not relevant at all to this issue, and in fact nothing about this issue involves the use of spack at all. I'd mostly like to bring to people's attention the extremely general API/language for specifying dependencies that spack has developed, which has the following features:

See my in-progress more-complete description at spack/build-si-modeling#2.

  1. Can represent the "concrete" result of a dependency resolution, as above. A "concretized" spec is a complete DAG where each node has an immutable .dag_hash() (on the left column of the above output). See the spack docs on this.

    • If the spec DAG in the above example had all of the .dag_hash() attributes magically removed, spack can re-resolve the package names/versions/variants from this spec DAG and deterministically recompute the exact same .dag_hash() for each node, assuming the state of the package repository is not modified in the meantime. See #1086, #1176, #1249, #1282, and pypa/pip#53.
    • spack has a buildcache concept for serving pre-built binaries. Each such binary package is defined by a concrete spec.yaml, which also happens to have a tarball alongside it.
  2. Can represent an "abstract" query against a package, where the query is itself a spec DAG, but which hasn't already computed a .dag_hash() for each of its nodes. This "query" formulation is used in multiple ways:

    • It can be a binary filter to match against a given concrete spec -- this is how spack checks for whether a pre-built binary package is applicable to the current resolve. See #1093, #1202.
    • It can be applied incrementally to a much larger concrete DAG, in order to resolve a concrete sub-DAG which satisfies the query. This is the process of "concretization".
    • In the case of a failing resolve, this spec format can be used to represent the minimum "unsat core" which is causing a conflict. See #1200.
      • Spack currently solves for this unsat core for its dependency model using a logic solver (see spack/spack#19501), but pip is perfectly able to do the same with its own new resolver. pypa/pip#7819 is an example of manipulating the new resolver to do some really drastic things.

And to more directly respond to this goal from the pex issue:

Its not clear if this should just be opaque text or more structured data.

  1. Can represent both concrete and abstract DAGs as either serializable json/yaml or as a fantastic command-line syntax (which looks like the output of spack spec above). It's this part in particular which I think makes the spack spec syntax really applicable and worth looking to directly as inspiration.

The way we want pex to extract graph information from pip for this issue would be a prototype of the following:

Spack: A Visionary Lost at C/C++

Another reason I specifically mentioned spack here as an example is because as described in its dependency model docs, spack doesn't magically infer the dependencies it consumes, they're explicitly specified in package.py files in the spack repo. So referring back to the initial command spack spec -l py-setuptools above -- while you might expect spack to incorporate each python package's dependencies, currently spack mostly scorches the earth and requires mildly-painstaking care to reproduce pip dependencies within its own package syntax (see the package.py for setuptools). Note that the package names are completely uncorrelated with the name in the package's setup.py (except by convention). Also, spack essentially resolves each python package in its own virtualenv (which the spack "concretizer" has no knowledge of).

However, these decisions actually make perfect sense if you recognize spack's primary use case is to be an extremely principled alternative to e.g. vendoring every C/C++ dependency in every project, or to using containers, neither of which allow composing a shared DAG from disparate sets of dependencies. Spack provides a ton of structure on top of that, but since C/C++ does not have any standardized package dependency specification, there's no immediate benefit to the spack project of trying to conform to another package manager's model, especially since I believe spack's model is currently a strict superset of every other tool (see spack/build-si-modeling#2).

ipex Lazy Deployment

The "ipex" lazy deployment mechanism of #789/pantsbuild/pants#8793 was a huge hit with Pex users at Twitter for several reasons, but I think most of those reasons are: like Pex itself, it's both a tool and a library, which gives it reliable behavior while also being composable into larger abstractions. In the case of ipex, this also means that engineers can really easily make changes to their deployment by modifying a single file in the repo: ipex_launcher.py. ipex's "hackable" or malleable quality was a direct result of building on top of Pex.

To that last point: spack/build-si-modeling#2 was an attempt to point out:

  1. spack's dependency model is extremely general.
  2. spack covers the foundations of the build graph that no other tool tries to do in such a general and successful way (nix comes close -- see spack/spack#21282).

In particular, spack/build-si-modeling#2 demonstrates that a truly general model of package resolution isn't just about dependencies--we have to drill down into the way individual packages are installed in order to reason about their compatibilities!

Tentative Milestones

Here's a quick summary of some tentative milestones, which get progressively more exciting:

  1. Serialize a fully reproducible resolution.

    • Solve #1086 using some json-fungible DAG specification syntax. Determine whether the abstract/concrete distinction can effectively represent a resolve request as well as a resolve lockfile.
    • In particular, pypa/pip#53 has some fantastic discussion about serializing a pip resolve -- one reason why that issue has languished is because so far, imho, ipex has been the only really compelling use case mentioned (pypa/pip#53 (comment)). I think the framing in this pex issue seems to demonstrate this concept is actually high-level enough to reasonably own within Pex (?).
  2. Bootstrap a build tool from scratch.

    • If pants wants to be able to "take over" after some bootstrapping, it has to be able to consume some representation of that bootstrapping.
    • Interoperability with other tools is precisely the goal of this serializable dependency graph format.
    • One way to produce a generic bootstrapping process for pants is to externalize the process of a pants build so another tool can perform it.
    • This is specifically where we would want to develop the ability to execute other resolvers in harmony!
  3. Represent the complete provenance of packages which consume dependencies from multiple package managers.

    • The provenance of many published packages (especially python native code) is extremely relevant to their actual compatibility, regardless of their stated compatibility. We would like to be able to express these dependencies in a single unified DAG. See coursier's example for JVM-only dependencies.
    • In particular, we could then represent the complete dependency graph of artifacts built from monorepos, so that they can be fungible with individual 3rdparty packages. Monorepos often have unpredictably-changing internal dependencies, which currently can't be expressed in any package manager. This means that consuming 3rdparty jars/wheels built from monorepo code (especially generated code such as thrift) can often break unpredictably without further provenance info.

Example CLI Session

Example command-line session with spack demonstrating how a user can:

  1. find a local version of ncurses and register it in spack's database.
  2. splice the ncurses (referring to it by its .dag_hash()) into the py-setuptools spec above.
  3. use "variants" (corresponding to "features" in pip lingo) to prune entire subtrees.
# cosmicexplorer@breeze: ~/tools/spack 16:47:38
; spack external find ncurses
==> The following specs have been detected on this system and added to /home/cosmicexplorer/.spack/packages.yaml
[email protected]
# cosmicexplorer@breeze: ~/tools/spack 16:47:58
; spack find -l ncurses
==> No package matches the query: ncurses
# cosmicexplorer@breeze: ~/tools/spack 16:48:03 $? 1
; spack spec [email protected]
Input spec
--------------------------------
[email protected]

Concretized
--------------------------------
[email protected]%[email protected]~symlinks+termlib abi=none arch=linux-arch-skylake

# cosmicexplorer@breeze: ~/tools/spack 16:48:15
; spack spec -l [email protected]
Input spec
--------------------------------
[email protected]

Concretized
--------------------------------
rynos3x  [email protected]%[email protected]~symlinks+termlib abi=none arch=linux-arch-skylake

# cosmicexplorer@breeze: ~/tools/spack 16:48:23
; spack spec -l 'py-setuptools ^ [email protected]+optimizations~sqlite3~dbm ^ ncurses/rynos3x'
==> Error: No installed spec matches the hash: 'rynos3x'
# cosmicexplorer@breeze: ~/tools/spack 16:48:29 $? 1
; spack install [email protected]
==> Warning: Missing a source id for [email protected]
[+] /usr (external ncurses-6.2.20200212-rynos3xdk4umluuzsenowx5ptzvvquvm)
# cosmicexplorer@breeze: ~/tools/spack 16:48:45
; spack spec -l 'py-setuptools ^ [email protected]+optimizations~sqlite3~dbm ^ ncurses/rynos3x'
Input spec
--------------------------------
py-setuptools
    ^[email protected]%[email protected]~symlinks+termlib abi=none arch=linux-arch-skylake
    ^[email protected]~dbm+optimizations~sqlite3

Concretized
--------------------------------
swycnin  [email protected]%[email protected] arch=linux-arch-skylake
d3utxvg      ^[email protected]%[email protected]+bz2+ctypes~dbm~debug+libxml2+lzma~nis+optimizations+pic+pyexpat+pythoncmd+readline+shared~sqlite3+ssl~tix~tkinter~ucs4+uuid+zlib patches=0d98e93189bc278fbc37a50ed7f183bd8aaf249a8e1670a465f0db6bb4f8cf87 arch=linux-arch-skylake
5tughpd          ^[email protected]%[email protected]+shared arch=linux-arch-skylake
bcrfpdf              ^[email protected]%[email protected] arch=linux-arch-skylake
sjyxmk2                  ^[email protected]%[email protected] arch=linux-arch-skylake
ooe2ajj          ^[email protected]%[email protected]+libbsd arch=linux-arch-skylake
ovy6w4s              ^[email protected]%[email protected] arch=linux-arch-skylake
sawguem          ^[email protected]%[email protected]+bzip2+curses+git~libunistring+libxml2+tar+xz arch=linux-arch-skylake
dg4ebtg              ^[email protected]%[email protected]~python arch=linux-arch-skylake
e4wxnwh                  ^[email protected]%[email protected] arch=linux-arch-skylake
ll7vwj7                  ^[email protected]%[email protected]~pic arch=linux-arch-skylake
n7kyjrl                  ^[email protected]%[email protected]+optimize+pic+shared arch=linux-arch-skylake
rynos3x              ^[email protected]%[email protected]~symlinks+termlib abi=none arch=linux-arch-skylake
66gbrsc              ^[email protected]%[email protected] arch=linux-arch-skylake
4fiyzvd          ^[email protected]%[email protected] patches=26f26c6f29a7ce9bf370ad3ab2610f99365b4bdd7b82e7c31df41a3370d685c0 arch=linux-arch-skylake
u2bmvyr          ^[email protected]%[email protected]~docs+systemcerts arch=linux-arch-skylake
ph4z4gd              ^[email protected]%[email protected]+cpanm+shared+threads arch=linux-arch-skylake
dpn5pad                  ^[email protected]%[email protected]~docs patches=b231fcc4d5cff05e5c3a4814f6a5af0e9a966428dc2176540d2c05aff41de522 arch=linux-arch-skylake
unaosui                  ^[email protected]%[email protected] arch=linux-arch-skylake
b3jflct                      ^[email protected]%[email protected] arch=linux-arch-skylake
niezlo7          ^[email protected]%[email protected] arch=linux-arch-skylake
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment