
This blog is the fourth in a series in which we consider machine learning from four different viewpoints. We either use gradient descent or a fully Bayesian approach and for each, we can choose to focus on either the model parameters or the output function (figure 1). In part I, we considered gradient descent with infinitesimal steps (gradient flow) for linear models with a least squares loss. This provided insights into both how the parameters and the output function evolve over time during training. In part II and part III, we explored gradient flow in deep neural networks and this led to the development of the neural tangent kernel.
The following articles in this series follow a similar structure. In this blog (part IV) we investigate Bayesian methods for linear models with a least squares loss. We first consider the parameter space perspective. Here, we no longer aim to find the best set of parameters, but instead compute a probability distribution over possible parameter values and take into account all of these possible values when we make predictions. In part V then consider the output function perspective. Here, we treat the training and test points as jointly distributed as a Gaussian (i.e., a Gaussian process) and exploit this to make new predictions. In part VI and part VII, we will apply these ideas to deep neural networks, leading to Bayesian neural networks (parameter space) and neural network Gaussian processes (function space).
This article is accompanied by example Python code.

Figure 1. Four approaches to model fitting. We can either consider gradient descent (top row) or the Bayesian approach (bottom row). For either, we can consider either parameter space (left column) or the function space (right column). This blog concerns the Bayesian approach in parameter space (lower-left panel). Part V of this series will consider the Bayesian approach in function space (lower-right panel). Figure inspired by Sohl-Dickstein (2021).
Parameter space
In the maximum likelihood approach, we establish the best model parameters by finding the minimum of the loss function using gradient descent (or a closed-form solution if available). To motivate Bayesian methods, we draw attention to three potential disadvantages of this scheme. First, we do not take account of any prior information we have about the parameter values. Second, the final choice of parameters may be somewhat arbitrary; there may be many parameter values that fit the data reasonably well, but which extrapolate from that data quite differently (figure 2). Third, we do not get an estimate of uncertainty; intuitively, we expect our predictions to be less certain when they are far from the training points.

Figure 2. Drawback of maximum likelihood fitting. a) This line is a reasonable fit to the data. b) However, this line with a different slope and intercept is also a plausible description, as is c) this third line. The observed noisy data does not really allow us to distinguish which of these models is correct. It is hence worrying that the predictions of the three models when they extrapolate to
Overview of the Bayesian approach
The Bayesian approach rejects the idea that we should find a single set of parameters. Instead, it treats all parameter values as possible but acknowledges that some are more probable than others depending on the degree to which they are compatible with the noisy data; rather than finding a single set of parameter values, the Bayesian approach finds a probability distribution over possible parameters in the training process.
This complicates inference (predicting the model output for a new input). We cannot just run the model straightforwardly as there is no single set of parameters. Instead, we make separate predictions for each possible set of parameter values and take a weighted average of these predictions, where the weight is given by the probability of those parameters as established from the training data.
From maximum likelihood to the Bayesian approach
In this section, we make these ideas concrete for an arbitrary model
For different types of problems like classification or ranking, the noise model would differ.
Maximum likelihood approach
As the name suggests, the maximum likelihood approach finds the parameters
To perform inference for a new input
Maximum a posteriori approach
We can easily extend this model to incorporate prior information about the parameters
This is known as the maximum a posteriori or MAP approach.
Bayesian approach
In the Bayesian approach, we apply Bayes’ rule to find a distribution over the parameters:
The left-hand side is referred to as the posterior probability of the parameters, which takes into account both the likelihood
As an aside, note that it is now easy to understand the term maximum a posteriori inference. Equation 3 maximizes the numerator of the right-hand side of equation 4 with respect to the parameters
Returning to the Bayesian approach, we are now faced with the problem of how to perform inference (compute the estimate
This can be interpreted as weighting together the predictions of an infinitely large ensemble of models (via the integral), where there is one model
Example: linear model
Let’s put these ideas into practice using a linear regression model
where
In a regression model, we assume that the noise is normally distributed around a mean predicted by the model so that for training data pair
where
where in the second line we have combined all of the
Maximum likelihood
The maximum likelihood parameters are
where we have converted to a minimization problem by multiplying by minus one, taken the log (which is a monotonic transformation and hence does not effect the position of the minimum), and removed extraneous terms. In principle, we could minimize this least squares criterion by gradient descent, but in practice, it is possible to take the derivatives of the argument of the minimization, set them to zero, and solve to find a closed-form solution (figure 3):

Figure 3. Maximum likelihood solution for linear regression model.
MAP solution
To find the MAP parameters, we need to define a prior. A sensible choice is to assume that the parameters are distributed as a spherical normal distribution with mean zero, and variance
By a similar logic to that for the ML solution, the MAP solution is:
and again, this has a closed form solution (figure 4c):
which is distinct from the maximum likelihood solution as long as

Figure 4. Maximum a posteriori (MAP) solution for linear regression model. a) Prior distribution
Bayesian approach
In the Bayesian approach we compute a probability distribution over possible values of the parameters
where, as before, the likelihood and prior terms are given by:
The posterior distribution can be computed in closed form using the following relations (proved at the end of this document) in turn on the numerator:
where
where the constants from the previous stage have necessarily canceled out with the denominator so that the result is a valid distribution. The posterior distribution

Figure 5. Posterior distribution over parameters. a) The posterior distribution
We now compute the predictive distribution over the prediction
where the first term on the right-hand side is given by:
and the second term by equation 17.
The integral can be performed by applying the first relations in equations 1.16 which yield a normal distribution in
This Bayesian formulation of linear regression (figure 6) is less confident about its predictions, and the confidence decreases as the test data

Figure 6. Predictive distribution Bayesian linear regression. a) Predictive distribution without added noise term (i.e., without the last term in the variance in equation 1.20). Cyan line represents the mean of the distribution. Shaded area shows
Practical concerns
To implement this model we must compute the
Fortunately, this term can be inverted far more efficiently when the number of data points
which are valid when
where we have explicitly noted the dimensionality of each of the identity matrices
Substituting these two inversion results back into equation 20, we get a new formula for the predictive distribution:
It is notable that only inner products of the data vectors (e.g., in the terms
Conclusion
This blog has discussed the Bayesian approach to machine learning from a parameter viewpoint. Instead of choosing a single set of ‘best’ parameters, we acknowledge that there are many choices that are compatible with the data. To make predictions we take an infinite weighted sum of all possible parameter choices, where the weight depends on the compatibility. We work through the example of linear regression, for which a closed form solution is available.
These ideas can be easily extended to classification tasks (see Prince, 2012) and to certain types of non-linear model (see part V of this series). However, as we shall see in subsequent articles, they struggle with more complex models such as neural networks.
The next article in this series (part V) also discusses the Bayesian viewpoint, but this time considers it from the Gaussian processes viewpoint which focuses on the output function. We define a prior directly on the output function which defines a joint distribution between data points. We perform inference by finding the conditional distribution of the function at unknown points give the observed values of the function at the known points. For the linear regression model, this yields exactly the same predictive distribution as the methods in this blog. However, as we shall see in future articles, it has quite different implications for more complex models like deep neural networks.
Change of variables relation
The change of variables relation (figure 7) for multivariate normals states that:
Proof:
where we have expanded the quadratic term and absorbed the term that does not depend on
as required.

Figure 7. Change of variables relation. A normal distribution in
Product of two Gaussians
The product of two Gaussians is another Gaussian:
Proof: We first write out the expression for the product of the normals:
where we have multiplied out the terms in the exponential and subsumed the terms that are constant in
Now we complete the square by multiplying and dividing by a new term, creating a new constant
Finally, we re-factor the exponential term into the form of a standard normal distribution, and divide and multiply by the associated constant term for this new normal distribution, creating a new constant
Constant of proportionality
The constant of proportionality
Proof: We will compute each of the constants in the previous section in turn.
Working first with the non-exponential terms, we have
which is the appropriate constant for the normal distribution. Returning to the exponential term, we have:
Now we use the identities
and
Substituting these into the main equation we get:
Putting this together with the constant term we have:
as required.
Inversion relation #1
Consider the
Proof:
Premultiplying the both sides by
as required.
Inversion relation #2: Sherman-Morrison-Woodbury
Consider the
This is sometimes known as the matrix inversion lemma.
Proof:
Now, applying inversion relation #1 to the term in brackets:
as required.
Join our Research team
RBC Borealis is looking to hire for various roles! Visit our career page and discover opportunities to work on cutting-edge models and products positively impacting millions of Canadians.
Careers at RBC Borealis