where \(\mathbf{\phi}(\mathbf{x})\) is a set of basis functions with a dummy zero component
\(\phi_0(\mathbf{x}) = 1\). These functions can belong to different families(for example Gaussian
or sigmoid functions), but in all that cases the model will be called linear regression
as it will be linear for \(\mathbf{w}\). The simplest choice of basis functions where \(y(\mathbf{x}, \mathbf{w}) = \mathbf{w}^T \mathbf{x}\).
where \(\mathbf{X}\) is a vector of inputs \((\mathbf{x}_1, \mathbf{x}_2,...,\mathbf{x}_N)\) and \(\mathbf{t}\) is a vector of target points
\(\mathbf{t} = (t_1, t_2,..., t_N)\) and \(\beta\) - a precision parameter.
Maximizing a log of likelihood (\ref{likelihood}) with respect to \(\mathbf{w}\) one can easili obtain an expression for the fitting parameters \(\mathbf{w}\) (so-called Normal equation)
Here we can generate some data and check the model. Let’s generate
a set of 10 observations from the linear function \(t = -3x+2+N(0,1/\beta)\) with \(\beta = 1.0\).
We wind the fitting parameters using normal equation.
Click to see the code
The difference with true values is \(\sim 25\%\) but this is the maximum result which can be obtained from such a dataset, so whese values could be regarded as a reference for other models.
One can also apply an iterative approach,
e.g. Gradient descent. In this case, we set some zero approximation for the parameters and gradually come closer to true values. The demonstration of such a procedure is on the figure below.
Click to see the code
Gradient descent
Bayesian approach (BA)
According to BA, we treat regression parameters not as unknown constants,
but as random variables. Therefore we shall define subjective information about
the set of parameters \(\mathbf{w}\) as a prior distribution.
There are lots of possibilities but the simplest is to
set this distribution to be a Gaussian,
as it is conjugate with respect to the likelihood
function (\ref{likelihood}).
\begin{equation}
p(\mathbf{w}) = N(\mathbf{w} \mid \mathbf{m_0}, \mathbf{S_0}),
\label{prior_w}
\end{equation}
where \(\mathbf{m_0}\) and \(\mathbf{S_0}\) are arbitrary defined as a first approximation.
With such a simple prior one can get the corresponding posterior distribution analytically (more details in )
According to the Bayes equation, the (\ref{posterior_w}) is nothing else but normalized product of (\ref{likelihood}) and (\ref{prior_w}). \(\mathbf{m}_1\) is a vector of means and \(\mathbf{S}_1\) is a covariance matrix for \(\mathbf{w}\).
Choosing as a first approximation \(\mathbf{m_0}=(0,0)\) and \(\mathbf{S_0} = \big(\begin{smallmatrix} 10 & 0\\ 0 & 10 \end{smallmatrix}\big)\), I obtained next results:
Click to see the code
Prior(left), posterior(middle) and fitted line with parameters as a mean of the posterior.
The values of \(\mathbf{m_1}\) are close to the values of \(\mathbf{w}\) obtained within the frequentists’ approach (FA), but shifted towards \(\mathbf{m_0}\) values.
Although it was easy enough to get the posterior distribution in the previous example, it is not always the case. Metropolis-Hastings method allows to sample values from complex probability functions. As posterior is proportional to the product of the likelihood and prior, it is enough to sample from such a product to find all the neccessary distribution parameters.
The first step is to test the Metropolis-Hastings sampler on the same normal prior. We know the analytical result so we can evaluate the precision of such model.
The starting point for the sampling is chosen to be \((0,0)\), the step \(\delta = 0.5\) and the number of iterations is \(N = 10^4\). The sampling result is below.
Click to see the code
Metropolis-Hastings sampling
Now we can find mean values of the samples and their covariance matrix.
These values are quite close to analytical (\ref{m1_res}) and (\ref{S1_res}). So this method sufficiently approximates posterior probability disribution.
We can observe, that the numerical approach works, so we can define the prior in a more complicated way and numerically estimate the regression parameters.
I define the prior distribution as a combiantion of two Normal distributions and apply it for the same dataset.
Click to see the code
Now I can generate samples using the same numerical algorythm.
These values are still quite close to analytical (\ref{m1_res}) and (\ref{S1_res}). The prior I have chosen is very unrealistic, so, in fact, it only complicates the procedure. Nevertheless, the sampling method works well and makes possible to retrieve distribution parameters.
In BA we regard parameters as random variables, so we are not satisfied with only point estimations of the target variable \(t\). We would like to estimate a credible interval for our prediction as well. For the Gaussian conjugated prior it is also possible to obtain analytical solution. Still, numerical methods can be used for the predictive distribution.