
This blog is the first in a series in which we consider machine learning from four different viewpoints. We either use gradient descent or a fully Bayesian approach, and for each, we can choose to focus on either the network weights or the output function (figure 1). To make these four viewpoints easy to understand, we first consider the linear regression model for which all four viewpoints yield tractable solutions.
This blog concerns gradient descent for which we can write closed-form expressions for the evolution of the parameters and the function itself. By analyzing these expressions, we can gain insights into the trainability and convergence speed of the model. In part II and part III, we replace the linear model with a neural network. This leads to the development of the neural tangent kernel (NTK), which provides insights into the trainability and convergence speed of neural networks.

Figure 1. Four approaches to model fitting. We can either consider gradient descent (top row) or a Bayesian approach (bottom row). For either we can consider either parameter space (left column) or the function space (right column). This blog concerns the gradient descent (top row) for the linear regression model. Parts II and III of this series concerns gradient descent in neural network models which leads to the Neural Tangent Kernel (NTK). Subsequent parts concern the Bayesian approach (bottom row) for linear regression and neural networks, which leads to Bayesian neural networks (parameter space) and neural network Gaussian processes (function space). Figure inspired by
In part IV and part V of this series, we consider the Bayesian approach to model fitting. We show that for the linear regression model, it is possible to derive a closed-form expression for the posterior probability over the model parameters and marginalize over this distribution to make predictions. Alternatively, we can consider the output function as a Gaussian process, and make predictions using the conditional relation for normal distributions. In parts VI and part VII, we again replace the linear regression model with a neural network and this leads to Bayesian neural networks (Bayesian approach with parameters), and neural network Gaussian processes (Bayesian approach with output function).
*The code related to this blog post can be found on github here.
Linear Regression model
A linear regression model
where
Such a model can be trained using
where we have substituted in the definition of the linear model in the second line. The factor 1/2 does not change the position of the minimum but simplifies subsequent expressions.
For convenience, we will store all of the training data vectors
Here, the model

Figure 2. Loss function. a) Consider fitting a line to three points. The line is defined by two parameters: the y-intercept
Gradient descent and gradient flow
In gradient descent, we randomly initialize the parameters of the model to values
More formally, when we perform gradient descent, we apply the update rule:
where
Gradient flow
The above description is how we typically think about gradient descent in machine learning. However, in this blog we consider what happens when we use an infinitesimally small learning rate
and then letting the learning rate
This ordinary differential equation (ODE) is known as gradient flow and tells us how the parameters change over time We will see that for the linear regression model and for known initial parameters, we can solve this ODE to derive a closed-form expression for the parameter values at any given time during training.
Gradient flow for least squares loss
For the least squares loss function we can expand the right hand side of equation 6 to get the expression:
where we have used the chain rule on the quadratic loss function between lines two and three.
Evolution of residual
We now consider the residual vector (i.e., the differences between the model predictions
where the first equality sign follows because only the first term
Closed-form solution for function evolution
The ODE in equation 8 describes an exponential decay of the residual
where
Rearranging, this equation, we can get a closed-form solution for the evolution of the function
This tells us how the function at the data points evolves as gradient descent proceeds. By analyzing these expressions we can get information about the trainability of the model (i.e., the ability of the model to fit the training data exactly) as well as the convergence speed of the training process (i.e., how fast this happens).
Trainability and convergence in the linear model
Let’s examine the implications of this expression for the linear model for which:
In this case the derivative does not depend on the current parameters
The evolution of this function is depicted in figure 3.

Figure 3. Function evolution for linear model. a) Consider fitting a line, starting at initial parameters
Trainability: We can draw conclusions about the trainability of the model by examining the exponential term in equation 12.
- If
is full rank, then the exponential term will become zero when and the function will fit the prediction exactly. - Conversely, if the matrix
it is not full rank, the function will not fit the prediction exactly. The residual errors depend on the null space of .
This aligns with our expectations. The data matrix
Convergence speed: Equation 12 can also tell us about the speed of convergence. The time taken to converge depends on the (i) the eigenvalues of
These results are perhaps not surprising for the the linear regression model, but in part II of this series, we’ll apply the same ideas to gain insight into the trainability and convergence of neural networks.
Evolution of parameters
Let’s assume that the model is close to linear. In this case, it is well approximated by retaining only the linear term from a Taylor expansion around the initial parameters
We replace the left hand side of equation 10 with this expression to yield:
Moving the first term to the right-hand side, we get:
Finally, we can solve for
where
Parameter evolution for linear model
Let’s examine parameter evolution for the linear model, where:
Here, the linear approximation is certainly valid (since the second derivatives are zero). Substituting these expressions into equation 16 for the overdetermined case, we get:
Figure 4 plots the evolution of the parameters for the fitting a line to the three points from figure 2.

Figure 4. Parameter evolution for fitting line to three points. a) Equation 18 shows how the parameters evolve on the loss function from their initial values
Comparison to closed-form solution
For the linear model, it is possible to compute the least squares solution for the parameters in closed form (see end of blog for proof) using the relation:
If the closed-form solution for the parameter evolution in equation 18 is correct, then we should retrieve this expression when
where we have separated out the two terms from the inner bracket in line two and simplified the first of these terms in the subsequent lines.
The first term is the same as the closed-form solution, so if the second term becomes zero as
as required.
Evolution of model predictions
Equation 18 provides a closed-form solution for the evolution of the parameters
For the linear model, the prediction for a new point
and when
The evolution of the function for the linear model is illustrated in figure 5.

Figure 5. Evolution of predictions for fitting line to three points. For any time
Evolution of prediction covariance
Finally, we note that the evolution of the predictions
where we have denoted the two components of the affine transform by the scalar
Now let’s define a normally distributed prior over the initial parameters
with mean zero and spherical covariance
and so can compute the mean and variance of the distribution

Figure 6. Evolution of predictions over time with errors. In this experiment, we assume that the prior variance
*The code related to this blog post can be found on GitHub here.
Discussion
This blog has considered gradient descent training using gradient flow, using linear regression as a working example. The main results are that:
- We can write a closed-form solution for the evolution of the function output at the training data points. By analyzing this solution, we can learn about the ability of the model to fit the dataset perfectly, and the speed of convergence.
- We can write a closed-form solution for the evolution of the parameters over time, which agrees with the least squares solution when
. - We can write a closed-form solution for the evolution of the predictions of the model as a function of time.
- By defining a prior over the parameters, we can also model the evolution of the uncertainty of our predictions.
In the next part of this series, we’ll develop the neural tangent kernel by applying these same ideas to a neural network model.
*The code related to this blog post can be found on github here.
Least squares solution for linear model
For a linear model, it is possible to find the final parameters in closed form. Writing out the loss function
where we have expanded out terms in the second line. We note that the second and third terms are both scalars and so are equal to their own transposes and these are combined into a single term in the third line.
We can find a closed-form solution for the best parameters by computing the derivative
which can be rearranged to find a solution for the best parameters
This result can only be computed if there are at least as many columns
Affine transformations of normal distributions
The multivariate normal distribution is defined as:
To transform the distribution as
We’ll first work with the constant
which is exactly the required constant.
Now we’ll work with the quadratic in the exponential term to show that it corresponds to a normal distribution in
This is clearly the quadratic term from a normal distribution in
Thanks to Lechao Xiao who explained the evolution of the parameters to me.
Careers
Artificial Intelligence is reshaping finance. Every day, our teams uncover new opportunities that advance the field of AI, building products that impact millions of people across Canada and beyond. Explore open roles!
Explore opportunities