Using Inverse & Implicit Function Theorems¶
Until now we have considered methods for computing derivatives that work directly on the function being differentiated. However, this is not always possible. For example, if the function can only be computed via an iterative algorithm, or there is no explicit definition of the function available. In this section we will see how we can use two basic results from calculus to get around these difficulties.
Inverse Function Theorem¶
Suppose we wish to evaluate the derivative of a function \(f(x)\), but evaluating \(f(x)\) is not easy. Say it involves running an iterative algorithm. You could try automatically differentiating the iterative algorithm, but even if that is possible, it can become quite expensive.
In some cases we get lucky, and computing the inverse of \(f(x)\) is an easy operation. In these cases, we can use the Inverse Function Theorem to compute the derivative exactly. Here is the key idea:
Assuming that \(y=f(x)\) is continuously differentiable in a neighborhood of a point \(x\) and \(Df(x)\) is the invertible Jacobian of \(f\) at \(x\), then by applying the chain rule to the identity \(f^{-1}(f(x)) = x\), we have \(Df^{-1}(f(x))Df(x) = I\), or \(Df^{-1}(y) = (Df(x))^{-1}\), i.e., the Jacobian of \(f^{-1}\) is the inverse of the Jacobian of \(f\), or \(Df(x) = (Df^{-1}(y))^{-1}\).
For example, let \(f(x) = e^x\). Now of course we know that \(Df(x) = e^x\), but let’s try and compute it via the Inverse Function Theorem. For \(x > 0\), we have \(f^{-1}(y) = \log y\), so \(Df^{-1}(y) = \frac{1}{y}\), so \(Df(x) = (Df^{-1}(y))^{-1} = y = e^x\).
You maybe wondering why the above is true. A smoothly differentiable function in a small neighborhood is well approximated by a linear function. Indeed this is a good way to think about the Jacobian, it is the matrix that best approximates the function linearly. Once you do that, it is straightforward to see that locally \(f^{-1}(y)\) is best approximated linearly by the inverse of the Jacobian of \(f(x)\).
Let us now consider a more practical example.
Geodetic Coordinate System Conversion¶
When working with data related to the Earth, one can use two different coordinate systems. The familiar (latitude, longitude, height) Latitude-Longitude-Altitude coordinate system or the ECEF coordinate systems. The former is familiar but is not terribly convenient analytically. The latter is a Cartesian system but not particularly intuitive. So systems that process earth related data have to go back and forth between these coordinate systems.
The conversion between the LLA and the ECEF coordinate system requires a model of the Earth, the most commonly used one being WGS84.
Going from the spherical \((\phi,\lambda,h)\) to the ECEF \((x,y,z)\) coordinates is easy.
Here \(a\) and \(e^2\) are constants defined by WGS84.
Going from ECEF to LLA coordinates requires an iterative algorithm. So to compute the derivative of the this transformation we invoke the Inverse Function Theorem as follows:
Eigen::Vector3d ecef; // Fill some values
// Iterative computation.
Eigen::Vector3d lla = ECEFToLLA(ecef);
// Analytic derivatives
Eigen::Matrix3d lla_to_ecef_jacobian = LLAToECEFJacobian(lla);
bool invertible;
Eigen::Matrix3d ecef_to_lla_jacobian;
lla_to_ecef_jacobian.computeInverseWithCheck(ecef_to_lla_jacobian, invertible);
Implicit Function Theorem¶
Consider now the problem where we have two variables \(x \in \mathbb{R}^m\) and \(y \in \mathbb{R}^n\) and a function \(F:\mathbb{R}^m \times \mathbb{R}^n \rightarrow \mathbb{R}^n\) such that \(F(x,y) = 0\) and we wish to calculate the Jacobian of \(y\) with respect to x. How do we do this?
If for a given value of \((x,y)\), the partial Jacobian \(D_2F(x,y)\) is full rank, then the Implicit Function Theorem tells us that there exists a neighborhood of \(x\) and a function \(G\) such \(y = G(x)\) in this neighborhood. Differentiating \(F(x,G(x)) = 0\) gives us
This means that we can compute the derivative of \(y\) with respect to \(x\) by multiplying the Jacobian of \(F\) w.r.t \(x\) by the inverse of the Jacobian of \(F\) w.r.t \(y\).
Let’s consider two examples.
Roots of a Polynomial¶
The first example we consider is a classic. Let \(p(x) = a_0 + a_1 x + \dots + a_n x^n\) be a degree \(n\) polynomial, and we wish to compute the derivative of its roots with respect to its coefficients. There is no closed form formula for computing the roots of a general degree \(n\) polynomial. Galois and Abel proved that. There are numerical algorithms like computing the eigenvalues of the Companion Matrix, but differentiating an eigenvalue solver does not seem like fun. But the Implicit Function Theorem offers us a simple path.
If \(x\) is a root of \(p(x)\), then \(F(\mathbf{a}, x) = a_0 + a_1 x + \dots + a_n x^n = 0\). So,
Differentiating the Solution to an Optimization Problem¶
Sometimes we are required to solve optimization problems inside optimization problems, and this requires computing the derivative of the optimal solution (or a fixed point) of an optimization problem w.r.t its parameters.
Let \(\theta \in \mathbb{R}^m\) be a vector, \(A(\theta) \in \mathbb{R}^{k\times n}\) be a matrix whose entries are a function of \(\theta\) with \(k \ge n\) and let \(b \in \mathbb{R}^k\) be a constant vector, then consider the linear least squares problem:
How do we compute \(D_\theta x^*(\theta)\)?
One approach would be to observe that \(x^*(\theta) = (A^\top(\theta)A(\theta))^{-1}A^\top(\theta)b\) and then differentiate this w.r.t \(\theta\). But this would require differentiating through the inverse of the matrix \((A^\top(\theta)A(\theta))^{-1}\). Not exactly easy. Let’s use the Implicit Function Theorem instead.
The first step is to observe that \(x^*\) satisfies the so called normal equations.
We will compute \(D_\theta x^*\) column-wise, treating \(A(\theta)\) as a function of one coordinate (\(\theta_i\)) of \(\theta\) at a time. So using the normal equations, let’s define \(F(\theta_i, x^*) = A^\top(\theta_i)A(\theta_i)x^* - A^\top(\theta_i)b = 0\). Using which can now compute:
Observe that we only need to compute the inverse of \(A^\top A\), to compute \(D x^*(\theta)\), which we needed anyways to compute \(x^*\).