- Author: Richard Wei
- Date: October 2018
This document is written for both the machine learning community and the Swift programming language design community, with a strong focus on language design.
- Introduction
- What is AD
- Why does Swift need AD?
- Why make AD first-class?
- Vision
- Part 1: Differentiable Types
- Part 2: Primitive Registration
- Part 3: Basic Differentiation
- Part 4: Generalized Differentiability
- Part 5: True Differential Operators
- Part 6: Generalized Types for Differentiation
- Part 7: Customizable Differentiation
- Acknowledgements
Automatic Differentiation (AD), also known as algorithmic differentiation, is a family of techniques used to obtain the derivative of a function. Functions can be represented as a composition of elementary operators whose derivatives are well-known. While partial derivatives can be computed through different techniques, the most common is a recursive application of the chain rule in the reverse direction, called reverse-mode AD. Reverse-mode AD computes vector-Jacobian products, i.e. partial derivatives with respect to each input parameter, and it has become a prerequisite for implementing gradient-based learning methods.
We aim to provide best-in-class AD, including the best optimizations, best error messages in failure cases, and the most flexibility and expressivity. To achieve this, we built support for AD right into the Swift compiler. This manifesto explains the design and vision of AD, and introduces to you the language extensions that will make Swift the world's first general-purpose differentiable programming language.
In basic calculus, differentiating a function of type
produces a function
that
maps points onto their corresponding slopes.
In the context of Swift, differentiating a function (Float) -> Float
produces
(Float) -> Float
. Functions with multiple arguments, such as (Float, Float) -> Float
, can be thought of as a function whose input domain is a product of
those arguments types, i.e.
,
so the derivative of such a function has type
(Float, Float) -> (Float, Float)
. According to this typing rule, the differential operator
can be declared as a
higher-order function, overloaded for each number of arguments because a Swift
function's argument list is not formally modeled as a tuple.
func 𝒟<T: FloatingPoint>(_ f: (T) -> T) -> (T) -> T
func 𝒟<T: FloatingPoint>(_ f: (T, T) -> T) -> (T, T) -> (T, T)
func 𝒟<T: FloatingPoint>(_ f: (T, T, T) -> T) -> (T, T, T) -> (T, T, T)
...
func f(_ x: Double, _ y: Double) -> Double {
return tanh(x + y)
}
𝒟(f) // (Double, Double) -> (Double, Double)
In numerical computing, users often write code that operate on high-dimensional
mathematical objects. The basic typing rules that we defined on real scalars
() can be generalized for
module-like types such as
vectors with extra consideration for shape. In vector calculus, the
differentiation of a function
is
defined per scalar because there are multiple inputs and multiple outputs. Full
differentiation of vector-valued function
will result in a matrix,
each of whose entries is a function that computes the partial derivative of an
output scalar with respect to an input scalar. This matrix is called a
Jacobian. In
this definition, the Jacobian matrix has type
.
For simplicity, we will model it as a function that maps vectors
to real-valued matrices
.
While it is challenging to define this function with full type safety in Swift because shapes cannot be generic parameters yet, we can define a differential operator as the following, specialized on shapes.
func 𝒟<T>(_ f: (Vector2<T>) -> Vector3<T>) -> (Vector2<T>) -> Matrix3x2<T>
where T: FloatingPoint
Calculating the Jacobian of a function is often unnecessary in gradient-based optimization methods. In practice, we care more about two byproducts of Jacobian calculation that are significantly easier to compute than the Jacobian itself: vector-Jacobian products and Jacobian-vector products. In these terms, "vector" refers to a vector of partial derivatives that are to be chained with the Jacobian by left-multiplication or right-multiplication. As we explain chaining next, we discuss how Automatic Differentiation comes in the picture.
When we let a one-hot row vector
left-multiply a Jacobian matrix of type
, we are selecting one
row in the matrix, which is exactly the
gradient of
evaluated at
, i.e.
.
When vector in
represents the gradient of another function
at
, namely
,
then the vector-Jacobian products represents
. The
linear function that takes a vector and left-multiplies it with the Jacobian is
also called a
pullback. We
can define this function in Swift as a higher-order function shown below. The
body of this function can be defined in terms of
𝒟
, the differential operator
that returns a Jacobian.
func pullback<T: FloatingPoint>(
of f: (Vector2<T>) -> Vector3<T>,
at x: Vector2<T>
) -> (Vector2<T>) -> Vector2<T>
return { adjoint in matmul(adjoint, 𝒟(f)(x)) }
}
However, when computing gradients or general vector-Jacobian products, we do not need to compute the Jacobian at all: Automatic Differentiation is here to help.
The chain rule of differentiation can be interpreted in left-associative order, i.e. accumulating each function's partial derivatives from the final output, eventiually reaching each input.
Similarly, when we let a column vector
right-multiply a
Jacobian value
matrix of type
, the result is a vector whose elements are exactly the
directional derivatives
of each
evaluated at
in direction
.
The linear function that takes a vector and right-multiplies the Jacobian value
matrix is called a
differential, and it
can also be defined in Swift as a higher-order function in terms of 𝒟
.
func differential<T: FloatingPoint>(
of f: (Vector2<T>) -> Vector3<T>,
at x: Vector2<T>
) -> (Vector3<T>) -> Vector3<T> {
return { tangent in matmul(𝒟(f)(x), tangent) }
}
Just like vector-Jacobian products, Jacobian-vector products are easy to compute using Automatic Differentiation. By simply applying the chain rule of differentiation from an input, we will accumulate each function's partial derivatives and reach each output.
AD has a rich background. For an in-depth introduction, here's some great documentation:
- Introduction to Automatic Differentiation
- Automatic differentiation in machine learning: a survey
- The simple essence of automatic differentiation
Swift is a new programming language in the machine learning space. Recently, the Swift for TensorFlow project brought the full power of a machine learning framework into the Swift programming language. Numerical computing has a very different set of requirements than application development and systems development, and we believe that Swift needs to better address those requirements and improve the usability of numerical software. One of the most important building blocks in machine learning and numerical computing is the ability to differentiate math code. Automatic Differentiation has been implemented in many languages, but because of language constraints and design trade-offs, many existing AD systems have limitations. We would like to take this opportunity to improve Swift, and demonstrate what Swift can offer in all areas of numerical computing in the presence of a compiler and a static type system.
Automatic Differentiation has been a research topic in scientific computing and high-performance computing for nearly half a century. Traditional tools such as OpenAD, TAPENADE and ADIFOR are tools that transform existing source code. There are many advanced techniques that improved the performance of derivatives written in FORTRAN, but these tools have not gained wide adoption in the machine learning community. More recent AD systems like Stalin∇ (pronounced Stalingrad, available as a dialect of Scheme) achieved good usability by integrating the differential operator into the language, and are equipped with a complete set of AD features (such as forward/reverse, nested AD, Hessians, Jacobians, directional derivatives and checkpointing). Along with libraries such as DiffSharp (available in F#), and ad (available in Haskell), they combine AD closely with functional programming languages.
Researchers in the machine learning community have built many library implementations of AD in Python and C++, including Autograd, TensorFlow, Pytorch, etc.
As Automatic Differentiation is an integral part of any machine learning framework, traditional designs and implementations of AD have some limitations. Some of these libraries are implemented as a transformation on a standalone DSL (a graph) with a closed set of operators. Others are implemented using operator overloading directly on a subset of the source language. Although these libraries have gained wide adoption, the ones that leverage ahead-of-time AD do not expose an easy-to-use programming model, and the ones that have a friendlier programming model lack static analysis to perform more optimized AD.
Recent projects such as Tangent, Myia, and Zygote.jl based their AD upon source code transformation (SCT), a technique that was common in advanced AD systems before the deep learning era such as Stalin∇. The first two libraries parse a Python subset into ASTs and transform a function to its derivatives either in AST or in a functional IR, and Zygote hooks into the Julia compiler and transforms Julia's IR directly. These tools are pushing the boundaries of dynamic languages.
We would like our AD system to feel native and expressive. AD in Swift aims to solve real-world usability problems by providing the best generalizations, best error messages in failure cases, composable differential operators, and fully customizable types and derivatives. To achieve this, we built support for AD right into the Swift language. Even though AD has been incubated as part of the Swift for TensorFlow project, we believe its importance and impact is beyond machine learning, so we decided to propose it eventually through Swift Evolution into the core language.
Swift will be world's first general-purpose differentiable programming language.
We expect Swift's language-integrated AD to be super easy to use in the context of machine learning, control in robotics, and scientific computing. AD is a general language feature that works seamlessly with third-party libraries such as TensorFlow.
struct Parameters: Differentiable, ParameterGroup {
var w1 = Tensor<Float>(randomNormal: [784, 30])
var b1 = Tensor<Float>(zeros: [30])
var w2 = Tensor<Float>(randomNormal: [30, 10])
var b2 = Tensor<Float>(zeros: [10])
}
var params = Parameters()
let minibatches = Dataset(...)
var optimizer = StochasticGradientDescent()
for (x, y) in minibatches {
let grads = gradient(at: params) { params in
let h1 = tanh(matmul(x, params.w1) + params.b1)
let ŷ = sigmoid(matmul(h1, params.w2) + params.b2)
let loss = (y - ŷ).squared().mean()
print("Loss is \(loss)")
return loss
}
optimizer.fit(¶ms, gradients: grads)
}
We want our AD system to be fully extensible to the point where users can request derivatives of a function taking their own user-defined numeric types, and even use this feature to implement structure-dependent algorithms such as tree-recursive neural networks. Therefore, when performing AD, Swift makes no special assumptions about individual math functions or the types it should support. We enable library designers and developers to easily define any type or differentiable functions, all in pure Swift code.
Swift supports protocol-oriented programming and first-class value
semantics. AD is deeply
integrated with value types and has full extensibility via protocol
conformances. The user can make their custom data structures differentiable
simply by declaring a conformance to Differentiable
protocol:
extension MyType: Differentiable {
...
}
Or make an obviously non-differentiable function differentiable by using the
@differentiable
attribute, specifying a "tangent" function for computing its
Jacobian-vector products, or an "adjoint" function for computing its
vector-Jacobian products.
@differentiable(tangent: tangentFoo, adjoint: adjointFoo)
func foo(_ x: Float) -> Float {
return Float(Int(x)) // obviously non-differentiable
}
func tangentFoo(_ x: (Float, Float), originalResult: Float) -> Float {
// Insert custom code to compute the directional derivative
}
func adjointFoo(_ x: Float, originalResult: Float, adjoint: Float) -> Float {
// Insert custom code to compute the gradient
}
With fully customizable data structures and derivatives, everything should feel
native in the language. In addition, differential operators are functional and
composable, and differentiability is naturally integrated in the type system.
All differential operators are defined in Swift, and developers can create their
own differential operators by composing existing ones. For example, the user can
use the "forward-on-reverse" approach to compute Hessian-vector
products, where the hvp(at:in:)
operator is defined as a native Swift function. The @autodiff(order: 2)
attribute in the closure type
signature marks the closure argument as being differentiable up to at least the
2nd order, so that the caller of hvp(at:in:)
will differentiate the actual
closure argument as needed.so that the caller of this function will implicitly
trigger differentiation as needed.
func hvp<T: Differentiable, R: FloatingPoint>(
at x: T, in f: @autodiff(order: 2) (T) -> R
) -> @autodiff(linear) (T) -> T {
return differential(at: x, in: gradient(of: f))
}
By building first-class AD into the programming language, we can provide better diagnostics about differentiability and numeric stability than any other dynamic languages, all at compile-time.
test.swift:58:10: error: function is not differentiable
return #gradient(funcToDiff)(x)
^ ~~~~~~~~~~
test.swift:54:10: note: expression is not differentiable
return middle2(x)
^
test.swift:50:10: note: when differentiating this function call
return middle(x)
^
test.swift:46:10: note: when differentiating this function call
return nested(y)
^
In common AD libraries, there are two differentiation styles: functional and imperative.
Syntax | Meaning | |
---|---|---|
Functional | let 𝝯f = gradient(of: f) 𝝯f(x) |
Differentiating a function |
Imperative | let y = f(x) gradient(of: y, wrt: x) |
Differentiating code traced through data flow |
Functional-style AD is transforming one function to another, producing a function that takes original arguments and returns the partial derivatives evaluated at each argument. Imperative-style AD, on the other hand, is a value-value dependency analysis. Although we use both notations in mathematics, imperative AD comes at the cost of semantic inconsistency with the host language, for example:
let y = f(x)
x = 3
gradient(of: y, wrt: x) // undefined
Semantically, y
is a value, but x
is both a value and a reference to a
memory location -- it is unclear what exactly we are differentiating with
respect to. Though making y
and x
have reference types could make this
particular example work out semantically, it would be fundamentally inconsistent
with Swift's core design where mathematical objects have value types, and would
also make scalar types like Float
incompatible with automatic differentiation.
We believe Swift's AD can achieve the same level of expressivity as imperative AD while preserving functional properties, and use language integration to push developers' productivity to the next level.
Swift is a general-purpose programming language. Therefore, not every function is mathematically differentiable, and not every type represents a real vector space to begin with. To make our system mathematically sound, we refine the Swift standard library to form a basis for automatic differentiation.
The starting point of this refinement is the fundamental numeric protocols.
In this section, we talk about how we improve the Numeric
protocol to support
the addition of vector types and protocols. Then, we introduce a protocol to
represent vector spaces as that would be a requirement for doing calculus.
Finally, we design a protocol specific to differentiation.
Revising the Numeric
protocol
The Numeric protocol today refines
ExpressibleByIntegerLiteral
.
This makes sense for scalars, but is not compatible with vector data structures
because type-checking would fail on the scalar multiplication operator.
On the Swift forum, we have discussed the fundamental blocker for vector types
to conform to the existing Numeric
protocol.
The consensus was to introduce a weakening of the Numeric
protocol to
represent the abstractions shared between scalars and vectors:
rng (We assumed that vector spaces
are rngs by endowing them with *
as element-wise multiplication). The protocol
will be called Arithmetic
.
public protocol Arithmetic: Equatable {
var zero: Self { get }
prefix static func + (x: Self) -> Self
static func + (lhs: Self, rhs: Self) -> Self
static func += (lhs: inout Self, rhs: Self) -> Self
static func - (lhs: Self, rhs: Self) -> Self
static func -= (lhs: inout Self, rhs: Self) -> Self
static func * (lhs: Self, rhs: Self) -> Self
static func *= (lhs: inout Self, rhs: Self) -> Self
}
The existing Numeric
will be changed to refine (inherit from) Arithmetic
,
keeping all of its existing behavior.
public protocol Numeric: Arithmetic, ExpressibleByIntegerLiteral {
associatedtype Magnitude: Comparable, Numeric
init?<T>(exactly source: T) where T: BinaryInteger
var magnitude: Magnitude { get }
}
After we introduce the Arithmetic
protocol, which makes the standard library
suitable for vector APIs and beyond, we can define a protocol that generalizes
vectors. Mathematically, a vector space is a rng if we endow them with *
as
element-wise multiplication. We represent vector spaces through the
VectorNumeric
protocol as follows. Scalar
is the type of the elements
of this vector space -- the field which the vector space is over.
Shape
is the shape of this vector space, which is
customizable. The initializer takes a value of the Scalar
type and a
Shape
and returns a vector of the specified shape.
/// A type that represents an unranked vector space. Values of this type are
/// elements in this vector space and with a specific shape.
public protocol VectorNumeric: Arithmetic {
/// The type of scalars in the vector space.
associatedtype Scalar: Numeric
/// The type whose values specifies the shape of an object in the vector
/// space.
associatedtype Shape
/// Create an object in the vector space with the specified shape by
/// repeatedly filling the object with the specified value.
///
/// - Parameters:
/// - repeatedValue: the value repeat for the specified shape
/// - shape: the shape
init(repeating repeatedValue: Scalar, shape: Shape)
/// The shape of this vector.
var shape: Shape { get }
/// Returns the scalar product of the vector.
static func * (scale: Scalar, value: Self) -> Self
}
Now we define a protocol that "activates" a type's differentiability. At a first
glance, the conforming type must also be a VectorNumeric
type. So we make this
protocol refine VectorNumeric
. Since differentiation only makes sense on real
vectors, we add a constraint on the associated type Scalar
such that it
conforms to FloatingPoint
.
public protocol Differentiable: VectorNumeric where Scalar: FloatingPoint {
}
You may notice that Differentiable
looks like a dummy protocol because it
doesn't have any requirements other than the ones inherited from
VectorNumeric
. Although under the current assumptions we can completely omit
the Differentiable
protocol and just have the AD system recognize
VectorNumeric
-comforming types whose scalar elements comform to
FloatingPoint
, we actually have theoretical and practical reasons to revise
the Differentiable
protocol later on. So we keep Differentiable
as a
separate protocol for now and build towards the final design at the end of this
document.
We are aiming for an open and extensible system, so we made the compiler agnostic of the actual operations - it does not have special knowledge of numeric standard library functions or distinguish between primitive operators and other functions. We recursively determine a function's differentiability based on:
-
whether a function has a primitive differentiability as specified in the standard or user-defined library, and
-
whether a function's definition (type signature and body) is differentiable by applying the chain rule of differentiation.
As such we provide a syntactic way of specifying the differentiability of a function, using either the function's linearity properties or a separate function to specify the "tangent code", which specifies how to differentiate the function in forward mode, or "adjoint code”, which specifies how to differentiate the function in reverse mode.
We introduce a declaration attribute @differentiable
to Swift's syntax. The
full grammar of @differentiable
is defined as follows:
differentiation-mode = 'forward' | 'reverse' | 'bidirectional'
differentiability = differentiation-mode | 'linear' | 'constant'
differentiability-wrt-self = 'wrt' ':' 'self'
differentiation-order = 'once'
differentiation-tangent-specifier = 'tangent' ':' declaration-name
differentiation-adjoint-specifier = 'adjoint' ':' declaration-name
differentiable-attribute = '@differentiable'
'(' differentiability
[ ',' differentiability-wrt-self ]
[ ',' differentiation-once ]
[ ',' differentiation-tangent-specifier ]
[ ',' differentiation-adjoint-specifier ]
')'
declaration-attribute = differentiable-attribute
The multiplication operator *
is differentiable with respect to its two
arguments. Here's how we make it differentiable in the standard library.
extension FloatingPoint {
@differentiable(bidirectional, tangent: tangentMul, adjoint: adjointMul)
static func * (x: Self, y: Self) -> Self { ... }
internal func tangentMul(
x: (Self, Self), y: (Self, Self), originalResult: Self
) -> Self {
return x.1 * y.0 + y.1 * x.0
}
internal func adjointMul(
x: Self, y: Self, originalResult: Self, seed: Self
) -> (Self, Self) {
return (seed * y, seed * x)
}
}
In TensorFlow, the convolution operator is only differentiable with respect to a subset of arguments. Here's how we make it differentiable so that it can be used for back-propagation.
@differentiable(reverse, adjoint: adjointConv2D)
public func conv2d(_ input: Tensor<Float>, filter: Tensor<Float>,
strides: @nondiff (Int32, Int32, Int32, Int32),
padding: @nondiff Padding) -> Tensor<Float> {
...
}
func adjointConv2D(_ input: Tensor<Float>, filter: Tensor<Float>,
strides: (Int32, Int32, Int32, Int32),
padding: Padding) -> (Tensor<Float>, Tensor<Float>) {
...
}
Differentiation parameters are marked inline at each argument position in the
function declaration. By default, every argument of the funtion is to be
differentiated with-respect-to, unless marked as @nondiff
.
When a differentiable attribute is applied on a method, or the getter of a
computed property in a type, the implicit self
argument often needs to be
differentiated with respect to. In order to make a function a primitive
differentiable with respect to self
, one can add wrt: self
to
the @differentiable
attribute.
There are five options for differentiability:
-
Forward:
@differentiable(forward, tangent: ...)
This option says that the function is forward-mode differentiable. Forward-mode differentiation requires the "tangent code" (or tangent function) of this function, so that Swift knows how to compute the function's directional derivatives in the direction specified by the tangent vector that has been forward-propagated to the tangent function.
The compiler will expect the name of the tangent function, with an expected type signature, to be specified later in the
tangent:
parameter in the attribute. -
Reverse:
@differentiable(reverse, adjoint: ...)
This option says that the function is reverse-mode differentiable. Reverse-mode differentiation requires the "adjoint code" (or adjoint function) of this function, so that Swift knows how to compute the function's vector-Jacobian products, where the vector, also called "adjoint vector", has been back-propagated to the adjoint function.
The compiler will expect the identifier of the adjoint function, with an expected type signature, to be specified later in the
adjoint:
parameter in the attribute. -
Bidirectional:
@differentiable(bidirectional, tangent: ..., adjoint: ...)
This option says that the function is both forward-mode differentiable and reverse-mode differentiable. The compiler will expect both the tangent function and the adjoint function to be specified later in this attribute.
-
Constant:
@differentiable(constant)
By definition, constant functions always have zero derivatives and are differentiable at any arbitrary order. So differentiating this function will result into a zero vector (or vectors, when the function has multiple differentiation arguments) with the same shape as each differentiation argument.
-
Linear:
@differentiable(linear)
By definition, a linear map is always a unary function and its Jacobian is the matrix associated with this linear transformation itself. In other words, both its differential and its pullback are itself.
As explained, differentiabilities have different functional requirements.
-
forward
differentiabilityWhen the differentiability is
forward
, the compiler expects atangent:
label in the attribute followed by the name (qualified or unqualified) of a tangent function that is to be associated with the original function. If the original function declaration has type(T0, ..., Tn) -> U
, then the expected type of the tangent function is((T0, T0), ..., (Tn, Tn), U) -> U
. As we can see, every argument of the original function has become a "dual number" in the tangent function represented as a tuple. The first element of such a tuple is the original argument, the second argument the forward-propagated directional derivatives, namely the the "vector" in "Jacobian-vector product". The last argument to the tangent function is the original function's result. The result of the tangent function is the directional derivatives. If any of the original arguments is marked as@nondiff
, it will not become a dual number in the tangent function's argument list but will remain as the original argument itself. -
reverse
differentiabilityWhen the differentiability is
reverse
, the compiler expects anadjoint:
label in the attribute followed by the name (qualified or unqualified) of an adjoint function that is to be associated with the original function. If the original function declaration has type(T0, ..., Tn) -> U
, then the expected type of the adjoint function is(T0, ..., Tn, U, U) -> (T0, ..., Tn)
. As we can see, the firstn
arguments to the adjoint function,T0, ..., Tn
, are the original arguments. The next argument is the original function's result. The last argument is the back-propagated partial derivatives at the original function's result, namely the "vector" in "vector-Jacobian product". The result of the adjoint function contains partial derivatives at each argument, if the argument has not been marked as@nondiff
. -
bidirectional
differentiabilityWhen the differentiability is
bidirectional
, the compiler expects bothtangent:
andadjoint:
arguments to be specified. -
Other differentiabilities
Other differentiabilities such as
constant
andlinear
do not require any associated functions. However, users can choose to specify tangent/adjoint function(s) for their own purposes such as custom optimizations.
When a function is marked as @differentiable
, Swift assumes it to be
higher-order differentiable, i.e. differentiable at all orders, unless once
is
specified in the attribute, in which case Swift will not guarantee any
higher-order differentiability. If their associated functions (tangent or
adjoint) are serialized, then their derivatives may be differentiable via a
separate code transformation.
Differentiabilities linear
and constant
guarantee smoothness, and they do
not have to be serialized whatsoever because their derivatives do not depend on
any code transformation.
forward
and reverse
transitively require the tangent function and the
adjoint function, respectively, to be differentiable with respect to the
original arguments. When compiling such declarations, Swift will verify the
tangent/adjoint function is also differentiable by static analysis. If they are
not differentiable, the compiler will error out, prompting the user to insert
once
in the @differentiable
attribute.
Example 1. Linear functions are differentiable at any order.
public extension Tensor {
@differentiable(linear, wrt: self)
func transposed() -> Self {
...
}
}
Example 2. A forward-mode primitive-differentiable function whose tangent is closed-form is differentiable.
// Okay, the tangent function is differentiable.
@differentiable(forward, tangent: tangentFoo)
func foo(_ x: Vector<Float>) -> Float {
return Vector(repeating: sin(x), shape: [2, 3])
}
func tangentFoo(_ dualX: (Float, Float),
originalResult: Vector<Float>) -> Vector<Float> {
let (x, dx) = dualX
// Differentiable because `Vector.init(repeating:shape:)`, `*`, `sin` and
// `cos` are all declared `@differentiable` and are differentiable.
return Vector(repeating: cos(x) * dx, shape: [2, 3])
}
Example 3. A reverse-mode primitive-differentiable function is not differentiable at a higher order because its adjoint is not differentiable.
@differentiable(reverse, adjoint: adjointBar)
func bar(_ x: Vector<Float>) -> Float {
return sin(x)[0]
}
var someGlobalVariable: Vector<Float> = [1, 1, 1]
func adjointBar(_ x: Vector<Float>, y: Float, adjoint: Float) -> Vector<Float> {
var ∂y∂x = Vector<Float>(repeating: 0, shape: x.shape)
someGlobalVariable[0] = cos(x[0]) * adjoint
∂y∂x[0] = someGlobalVariable[0]
return ∂y∂x
}
test.swift:3:35: error: function `bar` does not support higher-order differentiation
because its adjoint is not differentiable; would you like to add `once`?
@differentiable(reverse, adjoint: adjointBar)
^~~~~~~~~~
test.swift:8:6: note: `adjointBar` is defined here
func adjointBar(_ x: Vector<Float>, y: Float, adjoint: Float) -> Vector<Float> {
^~~~~~~~~~
test.swift:10:9: note: operation is not differentiable
∂y∂x[0] = cos(x[0]) * adjoint
^~~~~~~~~~~~~~~~~~~~~~~~~
The application of the chain rule of differentiation gives us vector-Jacobian products or Jacobian-vector products, given by functions. Now that we have defined primitive differentiable functions, Swift can recursively differentiate any function whose body is available to the compiler.
We start by introducing the syntax of two raw differential operators:
#gradient(f)
: Produces the gradient off
, wheref: ℝⁿ → ℝ
.#derivatives(f)
: Produces derivatives off
, wheref: ℝ → ℝᵐ
.
The syntax of these operators looks like macros, but we will generalize them and make them look much nicer towards in the second half of this document.
Example:
func f(_ x: Vector<Float>, _ w: Vector<Float>) -> Float {
return x • w
}
#gradient(f) // (Vector<Float>, Vector<Float>) -> (Vector<Float>, Vector<Float>)
func g(_ x: Float) -> (Vector<Float>, Vector<Float>) {
return x • w
}
#derivatives(g) // (Float) -> (Vector<Float>, Vector<Float>)
The grammar of these raw differential operators is defined as follows:
derivatives-operator = '#derivatives'
gradient-operator = '#gradient'
raw-differential-operator = derivatives-operator | gradient-operator
autodiff-argument-index-specifier = '.' integer-literal
autodiff-expression =
differential-operator '(' expression [ ',' 'wrt' ':' autodiff-argument-index-specifier ] ')'
expression = autodiff-expression
Gradient and derivatives are two special cases of differentiation where the output or the result is a scalar, respectively. When they are not a scalar, vector-Jacobian products and Jacobian-vector products are being computed with a vector. These cases are not obvious, but are required for modular machine learning APIs where each neural network layer defines a back-propagation method that takes a partial derivative vector back-propagated from the previous layer. As such, we add two extra differential operators which will be useful for computing these products.
#differential(f)
: Produces a function that takes the original arguments and returns the differential off
.#pullback(f)
: Produces a function that takes the original arguments and returns the pullback off
.
jvp-operator = '#differential'
vjp-operator = '#pullback'
raw-differential-operator = jvp-operator | vjp-operator
Example:
// A random generic function that is differentiable.
func f<T0, T1, U>(_ x: T0, _ y: T1) -> U
where T0: Differentiable, T1: Differentiable, U: Differentiable {
return someDifferentiableFunction(20, x + y)
}
#differential(f) // (T0, T1) -> (U) -> (U, (T0, T1))
// Description:
// (T0, T1) -> (U) -> (U, (T0, T1))
// ^~~~~~ ^ ^ ^~~~~~~~
// original args vector result Jacobian-vector products
#pullback(f) // (T0, T1) -> (U, (U) -> (T0, T1))
// Description:
// (T0, T1) -> (U, (U) -> (T0, T1))
// ^~~~~~ ^ ^ ^~~~~~~~
// original args result vector vector-Jacobian products
The compiler type-checks a #gradient(f)
, as well as other differential
operators, by searching for the closest match given the contextual type. f
is
expected to have a definition to be differentiable, and thus cannot be a
closure whose body is opaque to the compiler. If so, Swift reports an error.
Later in the compilation pipeline, the compiler recursively transforms the code
of f
to its gradient function ∇f
(or other functions in other modes of
differentiation), and replaces #gradient(f)
with ∇f
. Everything composes
together naturally. Now, differentiation works.
Automatic Differentiation based on raw differential operators is already available and being incubated temporarily on the "tensorflow" branch of Swift. Swift for TensorFlow development toolchains and tutorials are available for trying out this feature.
Automatic differentiation relies on the definition (body) of a function to be
able to differentiate it. Differential operators like #gradient
trigger the
differentiation of a function, and the differentiability of the function is
determined as differentiation goes. This works perfectly so far, but has a
number of problems.
Raw differential operators adopt the pound-keyword syntax, which has been
previously used for accessing compiler builtins, e.g. #file
and #dsohandle
,
referring to IDE-specific objects, e.g. #colorLiteral
and #imageLiteral
, and
interoperating with "stringly-typed" Objective-C key paths, e.g.
#keyPath(...)
. The pound-keyword syntax does not have native parsing support
for syntactic features like trailing closures, so it is hard to make the closure
code short under differential operators like #gradient
.
Example:
// Ideal
let dydx = gradient { x in
sin(x) + cos(x)
}
// Reality
let dydx = #gradient({ x in
sin(x) + cos(x)
})
When we introduced AD in Swift earlier in this document, we defined the differential operator as a higher-order function. Type checking and type inference were just expected to work like any other functions.
However, since the compiler needs to reject functions that are not
differentiable and differentiability is not part of the type system, even if we
were to redefine #gradient
as a higher-order function named gradient(of:)
,
the compiler would still have to maintain dedicated knowledge about this
function in order to reject invalid arguments.
As of now, the differentiability of a function is determined solely through two tests:
- Is the function a primitive-differentiable function (
@differentiable
)? - Can the function's body be differentiated in the differentiation mode associated with the differential operator applied?
This simple system works perfectly when differentiating concrete functions defined in a local module, but does not allow differentiation of opaque function values or methods required by protocols. While being free of serialization is not a strict requirement for numerical computing libraries, not supporting differentiation on protocol requirements fundamentally obstructs composable high-level APIs that rely on AD, such as machine learning model APIs.
There is no way to define a higher-order function that differentiates its
argument using #gradient
. Here's an example:
func foo(_ f: (Float) -> Float) -> Float {
return #gradient(f)(0)
}
test.swift:2:22: error: cannot differentiate an opaque closure
return #gradient(f)(0)
~~~~~~~~~~^~
test.swift:1:12: note: value defined here
func foo(_ f: (Float) -> Float) -> Float {
^~~~~~~~~~~~~~~~~~~
Closure arguments and dynamic dispatch are non-differentiable through direct
source code transformation. The compiler does not statically know where f
is
coming from, nor can it delegate the task of differentiation of argument f
to
each callsite of foo
because it cannot be expressed in the type system.
As we can see, the core of the problem with definition-based differentiability is the opacity of function. The restriction that differentiation depends on the full definition of a function to be seen by the differential operator makes it impossible to define protocol-oriented differentiable code, and is the primary hindrance to modular, composable differentiation APIs.
Turns out, this is not a new problem - we should learning from how we deal with
calling conventions in Swift. Functions with different calling conventions have
different type signatures, e.g. @convention(thick)
and @convention(thin)
,
and function convert back and forth through conversion thunks implicitly.
// A "thin" function that captures no variables.
// Its representation is `@convention(thin)` by default.
func f(x: Int) {
return x
}
var globalVar = 30
// A "thick" function that captures the value of `globalVar`.
// Its representation is `@convention(thick)` by default.
let g = { x in globalVar + x }
// A higher-order function.
// The closure argument `h`'s representation is `@convention(thick)`, because it should
// be able to take closures that capture variables.
func takeFunc(_ h: (Float) -> Float) { ... }
takeFunc(f) // Implicitly converted function `f` to a `convention(thick)` closure by
// creating a conversion thunk.
takeFunc(g) // `g` is thick already. No conversion needed.
Sometimes, different conventions have different binary representations for
storing captured variables and such, just like the example with f
and g
above. In AD, the only difference between a non-differentiable function and a
differentiated function (say, in reverse mode) is whether the function carries a
few other function pointers that represent the function's adjoint code, so we
can model differentiable functions using a "thicker" function type, which
bundles the original function representation along with pointers to the original
function's Jacobian-vector product functions and/or vector-Jacobian product
functions. When a normal function with a visible body gets passed as an
@autodiff
function, the function will be differentiated.
// `f` is a normal function that has type `(Float) -> Float`.
func f(x: Float) -> Float {
return sin(x)
}
// `f` gets implcitly converted (or more accurately, differentiated).
let g = f as @autodiff (Float) -> Float
func takesFunc(_ someFunc: @autodiff (Float) -> Float) {
#derivatives(someFunc)
...
}
// At the callsite of `takesFunc(_:)`, `f` gets implcitly differentiated to become
// `@autodiff (Float) -> Float`.
takesFunc(f)
If a normal function does not have a visible body, then it cannot be passed as
an @autodiff
function. Swift will show an error at compile-time.
var normalFuncWithOpaqueBody: (Float) -> Float = ...
takesFunc(normalFuncWithOpaqueBody)
test.swift:19:11: error: function is not differentiable, but the contextual type is
'@autodiff (Float) -> Float'
takesFunc(normalFuncWithOpaqueBody)
^~~~~~~~~~~~~~~~~~~~~~~~
test.swift:17:4: note: value defined here
var normalFuncWithOpaqueBody: (Float) -> Float = ...
^~~~~~~~~~~~~~~~~~~~~~~~
At first glance, this could even be an addition to the existing @convention
attribute as something like @convention(autodiff)
, however, differentiability
does not align semantically with @convention
. First, when a function becomes
its differentiable (or differentiated) form, its original calling convention is
not changed. Second, functions with any convention is technically
differentiable, including thin
, thick
, method
, etc. Third,
differentiability is not the only information that needs to be encoded --
there's also the order of differentiation. Therefore, we need a separate
dimension of "thickness" in the function type: differentiability.
We define a new formalization of differentiability in Swift's type system,
including an @autodiff
function type attribute, an extension to functions'
layout, and new syntax for selecting differentiable arguments.
The @autodiff
attribute on a function type specifies the function's
differentiability and differentiation order, just like @differentiable
on
function declarations. The biggest differences are
-
@differentiable
contains associated functions (tangent/adjoint) statically, but@autodiff
functions carry those extra function pointers in their binary representation as a runtime property. Any user of this function will be able to differentiate it, with differentiability guaranteed formally by the type system. With this addition to the type system, serialization/inlinability is no longer necessary because functions can be passed around without losing differentiability. -
Differentiation order is no longer once vs. infinite. Instead,
@autodiff
functions can specify a maximum order at which this function can be differentiated, unless the function is linear or constant. This is because function-representation-based differentiability requires functions to be differentiated ahead of becoming a value and being passed around.
The grammar for @autodiff
is defined as follows:
differentiation-order = 'order' ':' integer-literal
differentiability = 'forward' | 'reverse' | 'linear' | 'constant' | 'bidirectional'
autodiff-attr = '@autodiff' '(' [ differentiability ',' ] diff-order ')'
When a differentiability is specified on a function type, it's obvious that its
functions' differentiation behavior is akin to what's defined for the
@differentiable
declaration attribute. If no differentiability is specified,
this function is both forward-mode and reverse-mode differentiable (same as
bidirectional
).
It becomes increasingly clear that first-order differentiation will not, and
should not, require serialization, and only higher-order differentiation should
due to code size. In order to make the system consistent, we make each
@differentiable
function declaration result in an @autodiff
function.
Since we want to support differentiating opaque functions, we must support
creating one. The fact is, the user does not even need to know about @autodiff
or intentionally create differentiable functions if they are working with
functions in the current module. Whenever a local function declaration gets used
where the contextual type has an @autodiff
attribute on it, Swift
differentiates it. If differentiation fails, Swift reports an error at
compile-time.
For public APIs, we relax the constraint on @differentiable
so that it can be
applied to any function declaration without specifying a tangent or adjoint even
when the differentiability is forward/reverse. This is when Swift tries to
differentiate functions and export the derivatives as part of those public APIs: If
the function gets differentiated, its default type signature has @autodiff
attribute on it; otherwise, Swift reports an error to the user showing what's
non-differentiable.
In order for modular libraries to support opaque higher-order differentiation, the differentiation order must be specified in the closure type signature, so that the closure ABI is guaranteed to contain the higher-order derivative.
@autodiff(reverse, order: 2) (T) -> U
For example, function g
takes a differentiable function that is differentiable
up to at least the 3rd order, then differentiates it 3 times in the body.
// In a separate module:
func g(_ h: @autodiff(reverse, order: 3) (Float) -> Float) -> Float {
return #gradient(h)(1) +
#gradient(#gradient(h))(1) +
#gradient(#gradient(#gradient(h))(1)
}
We also extend the @differentiable
attribute so that it can specify an
primitive-differentiable function can be forced to be differentiated to a
specific order ahead of time. For example, when Swift compiles function f
below, this function will have been differentiated 6 times, and gradient
functions will be preserved in f
's ABI so that its derivatives can be called
from anywhere (any other Swift module, or even C). f
's default type signature
is @autodiff(reverse, order: 6) (Float) -> Float
.
@differentiable(reverse, order: 6)
public func f(_ x: Float) -> Float {
return pow(x, 6)
}
Differentiable functions with a maximum differentiation order can be implicitly
"down-ordered", that is, differentiable functions with a higher maximum
differentiation order can be implicitly converted to a function with a lower
maximum differentiation order. For example, we can directly pass f
as an
argument to g
.
g(f) // 156
Because of their mathematical properties, differentiabilities can be converted to one another statically without runtime overhead. For example, a constant function is also a linear function when it's unary; a linear function is a bidirectional-differentiable function whose tangent and adjoint are both themselves; any differentiability can be completely dropped from a function type, forming a "normal" function. This allows us to define generic algorithms using differentiation, without specializing them on function types of each differentiability.
The following table shows whether each differentiability (as a column label) can be converted to another (as a row label).
Convertible to: | None | Linear | Constant | Forward | Reverse | Bidirectional |
---|---|---|---|---|---|---|
None | ✔ | |||||
Linear | ✔ | ✔ | ✔ | ✔ | ✔ | |
Constant | ✔ | ✔ (unary) | ✔ | ✔ | ✔ | ✔ |
Forward | ✔ | ✔ | ||||
Reverse | ✔ | ✔ | ||||
Bidirectional | ✔ | ✔ | ✔ | ✔ |
What does differentiability conversion look like in real code? Just like
@convention
conversion, differentiability conversion is implicit and has
little mental overhead to the user.
let linear: @autodiff(linear) (Float) -> Float = ...
let bidir: @autodiff (Float) -> Float = ...
let const: @autodiff(constant) (Float) -> Float = ...
func foo(_: @autodiff(reverse) (Float) -> Float) { ... }
foo(linear) // Okay! Implicitly converted to `@autodiff(reverse)`.
foo(bidir) // Okay! Implicitly converted to `@autodiff(reverse)`.
foo(const) // Okay! Implicitly converted to `@autodiff(reverse)`.
...
Generalized Differentiability enabled us to define custom differential operators in a functional way. Now it's time to define the true differential operators.
We start with functions that take a function and produce a function that
computes derivatives or gradient. Recall that we already had built-in syntax
#gradient
and #derivatives
for computing gradients and derivatives, but we
are exploring more expressive APIs enabled by Generalized Differentiability
which enabled us to differentiate function arguments that are functions.
We define two forward-mode differential operators for computing basic derivatives:
derivatives(of:)
computes a derivatives function that takes a value and returns derivatives evaluated at the given value.derivatives(at:in:)
computes derivatives of a closure at a given value.
/// Computes derivatives of `body`.
func derivatives<T: FloatingPoint, R: Differentiable>(
of body: @autodiff(forward) (T) throws -> R
) rethrows -> (T) -> R {
return { x in #differential(body)(x)(1).1 } // seed = dx/dx = 1
}
/// Computes derivatives of `body` at scalar `x`.
func derivatives<T: FloatingPoint, R: Differentiable>(
at x: T, in body: @autodiff(forward) (T) throws -> R
) rethrows -> R {
return derivatives(of: body)(x)
}
We also define two reverse-mode differential operators for computing basic gradients:
gradient(of:)
computes a gradient function that takes a value and returns the gradient evaluated at the given value.gradient(at:in:)
computes the gradient of a closure evaluated at a given value.
/// Computes the gradient of `body`.
func gradient<T: Differentiable, R: FloatingPoint>(
of body: @autodiff(reverse) (T) throws -> R
) rethrows -> (T) -> T {
return { x in #pullback(body)(x).1(1) } // seed = dx/dx = 1
}
/// Computes the gradient of `body` at `x`.
func gradient<T: Differentiable, R: FloatingPoint>(
at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> T {
return gradient(of: body)(x)
}
As we can see, since we are to differentiate a higher-order function's argument
(thanks to Generalized Differentiability), we can define derivatives(of:)
and
gradient(of:)
as Swift functions in terms of more general raw differential
operators, #differential
and #pullback
, to replace #derivatives
and
#gradient
!
These differential operators work seamlessly with closure captures,
error-throwing functions, or arbitrary side-effecting code that do not
contribute to the closure result. This looks quite like value-based automatic
differentiation while the math is actually fully functional. This achieves a
similar level of expressivity as imperative-style automatic differentiation
libraries: Instead of writing gradient(...)
at the bottom of a forward pass,
one would just write it on top and have a trailing closure close over the
forward pass.
Example: Train a simple 2-layer perceptron. The snippet computes the gradient w.r.t. each parameter at each training step, prints a loss, and optimizes parameters.
struct Parameters: Differentiable, ParameterGroup {
var w1 = Tensor<Float>(randomNormal: [784, 30])
var b1 = Tensor<Float>(zeros: [30])
var w2 = Tensor<Float>(randomNormal: [30, 10])
var b2 = Tensor<Float>(zeros: [10])
}
var params = Parameters()
let minibatches = Dataset(...)
var optimizer = StochasticGradientDescent(learningRate: 0.1)
for (x, y) in minibatches {
let grads = gradient(at: params) { params in
let h1 = tanh(matmul(x, params.w1) + params.b1)
let ŷ = sigmoid(matmul(h1, params.w2) + params.b2)
let loss = (y - ŷ).squared().mean()
print("Loss is \(loss)")
return loss
}
optimizer.fit(¶ms, gradients: grads)
}
Since the trailing closure as an argument to gradient(at:in:)
, the forward
computation is just as customizable as within operator-overloading AD systems.
Users can do whatever they want to intermediate values or the result in the
primal computation.
That said, we would like to provide a way to have the differentiation API return the original result directly. Because of Generalized Differentiability, these APIs can be defined entirely as library functions using primitive differential operators.
/// Computes `body(x)` and derivatives of each scalar output of `body` at `x`.
func valueWithDerivatives<T: FloatingPoint, R: Differentiable>(
at x: T, in body: @autodiff(forward) (T) throws -> R
) rethrows -> (value: R, derivatives: R) {
return #differential(body)(x)(1)
}
/// Computes `body(x)` and the gradient of `body` at `x`.
func valueWithGradient<T: Differentiable, R: FloatingPoint>(
at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> (value: R, gradient: T) {
let (y, pullback) = #pullback(body)(x)
return (y, pullback(1))
}
Jacobian-vector products (forward-mode) and vector-Jacobian products (reverse-mode) are extremely useful differential operators for lots of tasks in numerical computing.
/// Computes Jacobian-vector products of `body` at `x`.
func jacobianVectorProducts<T: Differentiable, R: Differentiable>(
at x: T, vector: T,
in body: @autodiff(forward) (T) throws -> R
) rethrows -> R {
return #differential(body)(x)(vector)
}
/// Computes the vector-Jacobian products of `body` at `x`.
func vectorJacobianProducts<T: Differentiable, R: Differentiable>(
at x: T, vector: R,
in body: @autodiff(reverse) (T) throws -> R
) rethrows -> T {
return #pullback(body)(x)(vector)
}
In some cases, computational tasks rely on fully extensible differential
operators as well as maximum efficiency, e.g. computing vector-Jacobian products
as well as the original function's result. Luckily, the two operators we
mentioned in the very beginning when we introduced Jacobians are the ones we
need: differential and pullback. We have already had their raw operators
supported in the syntax: #differential
and #pullback
, but we can make them
nicer using by redefining them as Swift functions.
Function differential(at:in:)
computes the differential of a closure at a
certain point, and returns a linear map that takes a vector and returns
Jacobian-vector products.
/// Computes the differential of `body` at `x`.
func differential<T: Differentiable, R: Differentiable>(
at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> @autodiff(linear) (T) -> R {
return #differential(body)(x).1
}
Function differentialWithResult(at:in:)
computes the differential of a closure
at a certain point, and returns a linear map that takes a vector and returns
both the original function's result and Jacobian-vector products.
/// Computes the differential of `body` at `x` that also computes the value of
/// `body(x)`.
func differentialWithResult<T: Differentiable, R: Differentiable>(
at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> @autodiff(linear) (T) -> (originalResult: T, derivatives: R) {
return #differential(body)(x)
}
Function pullback(at:in:)
computes the pullback of a closure at a certain
point, and returns a linear map that takes a vector and returns vector-Jacobian
products.
/// Computes the pullback of `body` at `x`.
func pullback<T: Differentiable, R: Differentiable>(
at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> @autodiff(linear) (R) -> T {
return #pullback(body)(x).1
}
Function resultWithPullback(at:in:)
computes the pullback of a closure at a
certain point, and returns the original function's result and a linear map that
takes a vector and returns vector-Jacobian products.
/// Computes the original value of `body(x)` and the pullback at `x`.
func resultWithPullback<T: Differentiable, R: Differentiable>(
at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> (originalResult: T, pullback: @autodiff(linear) (R) -> T) {
return #pullback(body)(x)
}
It is amazing that we are able to define every differential operator in terms of
other differential operators. #differential
and #pullback
have become
unnecessary because the functional form is so much nicer, so we can teach the
compiler to recognize Swift functions differential(at:in:)
and
pullback(at:in:)
as the builtin "canonical" differential operator, and remove
all raw differential operators that start with a #
from the language.
Examples:
-
Chain directional derivatives freely using differentials.
let x = 0.5 let df = differential(at: x) { x in sin(cos(x)) } df(1) // (f(x), df/dx) df(derivatives(of: log)(t)) // (f(x), df/dt) df(derivatives(at: t, in: log)) // (f(x), df/dt)
-
Chain gradients freely using pullbacks.
let x = 0.5 let (y, df) = pullback(at: x) { x in cos(sin(x)) } df(1) // dy/dx df(gradient(of: log)(t)) // dy/dt df(gradient(at: t, in: log)) // dy/dt
Second-order optimization methods in machine learning make use of Hessians and Hessian-vector products, which can be hard to compute. Many AD libraries such as Autograd already support Hessians by supporting arbitrarily nested forward-mode/reverse-mode differentiation. Hessian-vector products can be efficiently computed by applying "forward-on-reverse", namely applying the composition of the forward-mode differential operator and the reverse-mode differential operator on a function.
Just like other differential operators, we can define the Hessian-vector products operator in a simple, functional way.
func hvp<T: Differentiable, R: FloatingPoint>(
at x: T, in f: @autodiff(order: 2) (T) -> R
) -> @autodiff(linear) (T) -> T where T: Differentiable {
return differential(at: x, in: gradient(of: f))
}
Nested differentiation without a careful implementation is prone to a bug known as perturbation confusion [1] [2]. Language-integrated AD in Swift will enforce tagging in compiler-generated code to guarantee the correctness of higher-order derivatives.
Earlier in this document, we discussed enhancements to standard library
protocols and extensions to the standard library to model differentiable types.
These protocols are general enough for standard library types such as floating
point scalars (Float
, Double
, and Float80
) and potentially SIMD
vectors.
However, in any general-purpose programming language, there is always a question
of how much math the standard library should have.
We think basic differential operators like gradient(of:)
and
derivatives(of:)
should be included in the standard library, because they are
common operators that one would find in college calculus, and they will make AD
feel more language-integrated along with standard library protocols
VectorNumeric
and Differentiable
.
We do believe that other operators that contain terms like "Jacobian" and "differential" should be in a separate module, possibly called "AutomaticDifferentiation" that ships with the Swift language.
We introduced the Differentiable
protocol that makes a type represent a vector
space and be differentiable. However, there are a few scenarios where such a
protocol won't work well.
-
Customizable weight type
Orthogonal weight matrixes have shown advantages in neural network training [1] [2]. When differentiating through these networks, gradients with respect to weights will no long stay orthogonal - instead, they are skew-symmetric matrices. While we can represent both orthogonal matrices and skew-symmetric matrices as values of a
Matrix
orTensor
type and programmatically ensure its orthogonality, some researchers have been seeking a way to represent this natively in the type system of a programming language and still have AD produce the correct derivative. -
Quantized training
Quantization techniques store and calculate numbers in more compact formats, i.e. a fixed-point data type. Conceptually, a quantized tensor for a real-valued
Tensor
can be defined as the following struct:public struct Quantized<Dequantized: Quantizable, QuantizedScalar: FixedWidthInteger> { var data: Quantizable var range: Range<Dequantized.Scalar> var scale: QuantizedScalar var zeroPoint: Int }
We can think of a scenario where the developer defines a neural network as a function whose parameters are of type
Quantized<Tensor<Float>>
. When training parameters to this neural network, gradients need to flow at a significantly higher precision, but today's system cannot achieve that because it assumes gradients to have the same type as the original arguments. -
Generic optimizers
Optimization problems in machine learning can be generalized by optimization on manifolds. Optimizers in most libraries assume the original space and the loss space both to be vector spaces, and perform an implicit conversion from cotangent vectors to tangent vectors and another conversion from tangent vectors to the original weight type when performing
θ -= η * ∂L/∂θ
. While this works for most cases, it won't generalize over typed orthogonal matrices, because orthogonal matrices are not vector spaces, and a conversion from an orthogonal matrix to a skew symmetric matrix cannot be implicit.
To address concerns raised above, we've managed to find a more general answer to
modeling differentiable types. Instead of requiring them to be vector spaces
(VectorNumeric
), we model them as differentiable
manifolds. Reverse-mode
differentiation on function over manifolds produces gradients vectors in its
cotangent bundle; forward-mode differentiation produces derivatives in its
tangent bundle. Note that we cannot represent tangent/cotangent bundles
separately from tangent/cotangent spaces inside each bundle, because Swift does
not have dependent types. By removing the restriction to VectorNumeric
,
Differentiable
is now fully extensible.
/// A type that mathematically represents a differentiable manifold whose
/// tangent spaces are finite-dimensional.
///
/// In automatic differentiation, differentiation will produce a Jacobian whose
/// elements are of `Tangent` type.
public protocol Differentiable {
/// The tangent vector space of this differentiable manifold.
associatedtype TangentVector: VectorNumeric
where TangentVector.Scalar: FloatingPoint
/// The cotangent space of this differentiable manifold.
associatedtype CotangentVector: VectorNumeric
where TangentVector.Scalar: FloatingPoint
/// Returns `self` moved along the value space towards the given tangent
/// vector. In Riemannian geometry (mathematics), this is usually equivalent
/// to retraction or exponential map.
func moved(toward direction: TangentVector) -> Self
/// Convert a cotangent vector to its corresponding tangent vector.
func tangentVector(from cotangent: CotangentVector) -> TangentVector
}
When the tangent vector of a differentiable manifold is equal to its cotangent
vector, we can simply provide a default implementation of
tangentVector(from:)
, which is just the identity function.
public extension Differentiable where TangentVector == CotangentVector {
func tangentVector(from cotangent: CotangentVector) -> TangentVector {
return cotangent
}
}
When a differentiable manifold is a vector space, it's tangent space is usually
itself. In these cases, we simply define moved(toward:)
as vector addition.
public extension Differentiable
where Self: VectorNumeric, TangentVector == Self {
func moved(toward direction: TangentVector) -> Self {
return self + direction
}
}
It is very common for numerical computing to deal with lots of parameters, each
of which is a vector or a matrix. In these cases, instead of manually specifying
each input in a differential operator's argument list, users would often like
to differentiate through structures and obtain a structure of partial
derivatives. It is important for the Swift to provide derived conformances for
core protocols for numerical computing: Differentiable
and VectorNumeric
.
Mathematically, it is straightforward to represent product types. A struct or tuple in Swift corresponds to a product of sets; an enum in Swift corresponds to an addition of sets.
struct Parameters: VectorNumeric, Differentiable {
var a: Vector<Float>
var b: Float
}
Struct Parameters
is equivalent to a product of sets Vector<Float>
and
Float
, or a product of a real vector space ℝⁿ
and a scalar field ℝ
, namely
ℝⁿ ⨯ ℝ
, which is also a vector space. To make Parameters
obtain the traits
of a vector space, we extend the compiler to derive a conformance to
VectorNumeric
similar to how Codable
and Hashable
conformances are
derived. When a conformance clause is given in the current file and when all
stored properties conform to VectorNumeric
with the same Scalar
, the
compiler synthesizes AST to make this type conform, with all protocol requirements
applying property-wise.
After deriving conformances to VectorNumeric
:
struct Parameters: VectorNumeric {
var a: Vector<Float>
var b: Float
// derived:
typealias Scalar = Float
// derived:
struct Shape {
var a: Vector<Float>.Shape
var b: Float.Shape
}
// derived:
static func + (lhs: Parameters, rhs: Parameters) -> Parameters {
return Parameters(a: lhs.a + rhs.a, b: lhs.b + rhs.b)
}
// ...
}
In order for Parameters
to be differentiable, it must also need to conform to
Differentiable
. Deriving conformances to Differentiable
can follow the same
rules.
struct MyShapes: Differentiable {
var a: Circle // conforms to Differentiable
var b: Square // conforms to Differentiable
}
After deriving conformances to Differentiable
:
struct MyShapes: Differentiable {
var a: Circle
var b: Square
// derived:
struct TangentVector: VectorNumeric {
var a: Circle.TangentVector
var b: Square.TangentVector
}
// derived:
struct CotangentVector: VectorNumeric {
var a: Circle.CotangentVector
var b: Square.CotangentVector
}
// derived:
func moved(toward direction: TangentVector) -> MyShapes {
return MyShapes(a: a.moved(toward: direction.a),
b: b.moved(toward: direction.b))
}
// derived:
func tangentVector(from cotangent: CotangentVector) -> TangentVector {
return TangentVector(a: a.tangentVector(from: cotangent.a)
b: b.tangentVector(from: cotangent.b))
}
}
With derived conformances to these protocols, the user can now write arbitrarily nested structs of differentiable manifolds, and make them differentiable with trivial effort, greatly simplifying the development.
In the new Differentiable
protocol, we added Tangent
and Cotangent
types
to represent the type of Jacobian-vector products and vector-Jacobian products,
respectively. We make the following changes to the existing differential
operators we introduced.
- Differential operators that return
T
as a forward-differentiated derivative will returnT.Tangent
instead. - Differential operators that return
T
as a reverse-differentiated derivative will returnT.Cotangent
instead. - Vectors
T
for computing Jacobian-vector products will becomeT.Tangent
. - Vectors
T
for computing vector-Jacobian products will becomeT.Cotangent
.
Here we list a few updated differential operators.
Jacobian-vector products (forward-mode) and vector-Jacobian products (reverse-mode) are extremely useful differential operators for lots of tasks in numerical computing.
/// Computes Jacobian-vector products of `body` at `x`.
func jacobianVectorProducts<T: Differentiable, R: Differentiable>(
at x: T, vector: T.TangentVector,
in body: @autodiff(forward) (T) throws -> R
) rethrows -> R.TangentVector {
return #differential(body)(x)(vector)
}
/// Computes the vector-Jacobian products of `body` at `x`.
func vectorJacobianProducts<T: Differentiable, R: Differentiable>(
at x: T, vector: R.CotangentVector,
in body: @autodiff(reverse) (T) throws -> R
) rethrows -> T.CotangentVector {
return #pullback(body)(x)(vector)
}
/// Computes the differential of `body` at `x`.
func differential<T: Differentiable, R: Differentiable>(
at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> @autodiff(linear) (T.TangentVector) -> R.TangentVector {
return #differential(body)(x).1
}
/// Computes the differential of `body` at `x` that also computes the value of
/// `body(x)`.
func differentialWithResult<T: Differentiable, R: Differentiable>(
at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> @autodiff(linear) (T.TangentVector) -> (originalResult: T, derivatives: R.TangentVector) {
return #differential(body)(x)
}
/// Computes the pullback of `body` at `x`.
func pullback<T: Differentiable, R: Differentiable>(
at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> @autodiff(linear) (R.CotangentVector) -> T.CotangentVector {
return #pullback(body)(x).1
}
/// Computes the value of `body(x)` and the pullback at `x`.
func resultWithPullback<T: Differentiable, R: Differentiable>(
at x: T, in body: @autodiff(reverse) (T) throws -> R
) rethrows -> (originalResult: T, pullback: @autodiff(linear) (R.CotangentVector) -> T.CotangentVector) {
return #pullback(body)(x)
}
Recall that the motivation of introducing a general, future-proof
Differentiable
protocol is to be able to model the following use cases.
-
Neural network with orthogonal weights can now be differentiable. We can define a type called
OrthogonalMatrix
to conform toDifferentiable
, and another typeSkewSymmetricMatrix
to conform to bothDifferentiable
andVectorNumeric
.struct SkewSymmetricMatrix: Differentiable, VectorNumeric { typealias Scalar = Float ... } struct OrthogonalMatrix: Differentiable { ... typealias TangentSpace = SkewSymmetricMatrix typealias CotangentSpace = SkewSymmetricMatrix }
When we differentiate a function
(OrthogonalMatrix) -> Float
using the reverse-mode differential operator, we'll get a function(OrthogonalMatrix) -> SkewSymmetricMatrix
. Everything falls out, without type safety compromises. -
Differentiating a quantized network is now possible with AD.
// `Quantized` is a vector space when the dequantized type is one. extension Quantized: VectorNumeric where Dequantized: VectorNumeric { typealias Scalar = Dequantized.Scalar static func + (lhs: Quantized, rhs: Quantized) -> Quantized { // Custom code: Dequantize, add, and requantize! } static func * (lhs: Scalar, rhs: Quantized) -> Quantized { // Custom code: Dequantize, add, and requantize! } } // `Quantized` is a differentiable manifold when the dequantized type is one. extension Quantized: Differentiable where Dequantized: Differentiable { typealias TangentVector = Dequantized.TangentVector typealias CotangentVector = Dequantized.CotangentVector func moved(toward tangent: Dequantized.TangentVector) -> QuantizedTensor { // Custom code: Dequantize, optimize, and requantize! } }
With
Quantized
conforming to the newDifferentiable
protocol, when we differentiate a function of type(Quantized<Tensor<Float>, Int8>) -> U
, AD produces a function of type(Quantized<Tensor<Float>, Int8>) -> Tensor<Float>
, which is close to exactly what we need in quantized training of neural networks. -
Generic optimizers can be defined in terms of manifold optimization functions, without implicit casting.
extension SGD { func fit(_ parameters: inout Parameters, gradients: Parameters) { parameters.update(withGradients: gradients) { θ, g in θ = θ.moved(toward: -θ.tangentVector(from: g) * learningRate) } } }
Some machine learning models require manipulating the gradient with respect to certain values, e.g. gradient clipping. Tangent provides such a feature as a syntax extension in Python. Recurrent neural networks often suffer from the "exploding gradient" problem, and a typical solution is to force the gradient of an RNN to not exceed a certain value by performing gradient clipping.
func prediction(for input: Tensor<Float>) -> Float {
var prediction = input
for _ in 0...5 {
// Clip gradient.
prediction = prediction.withCustomizedGradient { grad in
max(min(grad, 1), -1)
}
prediction = lstm.prediction(for: input)
}
return prediction
}
APIs withCustomizedGradient(_:)
and withCustomizedDerivatives(_:)
look like
a compiler-known function which makes Swift run customized code in
differentiated code. However, because of the generality of the differential
registration mechanism, these functions can be
defined entirely as a Swift function with no special support from the compiler.
Here's the implementation of these APIs.
public extension Differentiable {
@differentiable(forward, wrt: self, tangent: tangentCustomizingDerivatives)
func withCustomizedDerivatives(
_ body: @nondiff (TangentVector) -> TangentVector
) -> Self {
return self
}
internal func tangentCustomizingDerivatives(
body: (TangentVector) -> TangentVector,
originalResult: Self,
tangent: TangentVector
) -> TangentVector {
return body(tangent)
}
@differentiable(reverse, wrt: self, adjoint: adjointCustomizingGradient)
func withCustomizedGradient(
_ body: @nondiff (CotangentVector) -> CotangentVector
) -> Self {
return self
}
internal func adjointCustomizingGradient(
body: (CotangentVector) -> CotangentVector,
originalResult: Self,
adjoint: CotangentVector
) -> CotangentVector {
return body(adjoint)
}
}
This API supports many gradient manipulation tasks in machine learning optimization. For example, the user can make gradient computation trigger a break from the loop.
var prediction = input
for _ in 0...5 {
// Stop loop when necessary.
var shouldStop = false
prediction = prediction.withCustomizedGradient { grad in
if grad < lowerBound {
shouldStop = true
}
return grad
}
if shouldStop {
break
}
prediction = lstm.prediction(for: input)
}
Setting a mutable flag is not the most user-friendly way. We can create APIs
that wrap withCustomizedDerivatives(_:)
and withCustomizedGradient(_:)
and
return a Bool
, so that later code can decide whether to break
from the loop
based on the return value from that API. Or better, if Swift supports non-local
control flow, i.e. a branch from nested closures, the code can be written just
as a break.
var prediction = input
for _ in 0...5 {
// Stop loop when necessary.
prediction = prediction.withCustomizedGradient { grad in
if grad < lowerBound {
break
}
return grad
}
prediction = lstm.prediction(for: input)
}
The author would like to thank Dan Zheng, Chris Lattner, Alex Wiltschko, Bart van Merriënboer, Gordon Plotkin, Dougal Maclaurin, Matthew Johnson, Casey Chu, Tim Harley, Marc Rasi, and Dmitri Gribenko for their input to the initial design of this powerful language feature.