Maximum Likelihood Estimation: A Brief Introduction with Example

Reading Time: 7 min 37 sec

In this post I will try to briefly explain what maximum likelihood estimation (MLE) is and give an example application of it on linear regression. We will also explore MLE’s relationship with Bayesian Inference, i.e. how it can be explained with a Bayesian perspective.

What is MLE?

Situations often arise in uncertainty modelling where we need to estimate the parameters \(\theta\) of a probability distribution. This distribution serves to express an assumed statistical model, e.g. types of jobs distributed in a country’s population. The idea is that we can estimate the entire population’s distribution from a smaller sample of that population.

MLE is one of these estimation methods. It works by maximising a likelihood function, s.t. the observed data is most probable in our model’s view. In other words, MLE tries to maximise the likelihood that a hypothesis (our model) produces the observed sample data.


Lili
Ok...but what exactly is a likelihood function? How can you define a model's closeness in generating "realistic" data?

Likelihood Function

Let’s consider a sample dataset of size n. These samples are independently generated by our real distribution \(P_{data}(x)\). The likelihood function can then be written intuitively as:

\[L\left( x_{1},x_{2},...x_{n}|\theta \right) =\prod^{n}_{i=1} P\left( x_{i}|\theta \right)\]

This is essentially the joint probability of the sample data from our model. It could be understood as “the probability of observing such particular samples from the model”. Therefore, the likelihood function describes how well our model could produce the sample dataset. The higher the likelihood, the better our model is able to reproduce the sample data from the real population.

However, the chain product here is cumbersome for computation. Instead, we usually take the log of our likelihood function:

\[l\left( x_{1},x_{2},...x_{n}|\theta \right) =log\left( \prod^{n}_{i=1} P\left( x_{i}|\theta \right) \right) =\sum^{n}_{i=1} log P\left( x_{i}|\theta \right)\]

This is called the log likelihood function in many contexts.


The objective is quite clear here: we would like to maximise our likelihood function. Specifically, we would like to find the \(\theta\) that achieves this maximisation. We can write the principle of MLE then:

\[\begin{gathered}\theta =\text{arg max}_{\theta } l\left( x|\theta \right) \\ =\text{arg max}_{\theta } \sum_{n} logP_{model}\left( x^{i}|\theta \right) \\ \end{gathered}\]

Finally, since argmax is immune to scaling, we could divide the principle by sample size n on both sides and get the principle in terms of expectation on the data distribution1:

\[\theta = \text{arg max}_{\theta } \mathbb{E}_{x\sim P_{data}} logP_{model}\left( x|\theta \right)\]

If \(l\left( x\vert\theta \right)\) is differentiable in \(\theta\), then we can simply get \(\theta\) from \(\frac{\partial l}{\partial \theta } =0\). Till here, we have successfully retrieved the estimated parameters \(\theta\) from MLE.

In optimisation problems, minimising the cost function rather than maximising it is a common practise. Therefore, we usually write the objective function of an MLE as the negative log likelihood:

\[\theta =\text{arg min}_{\theta } -\sum_{n} logP_{model}\left( x^{i}|\theta \right)\]

Let’s solidify this method with an example next, inspired by the lecture notes from Prof. Shalizi at CMU.2

Example: Gaussian Noise Linear Regression with MLE

Linear regression is a common estimator used in statistics. The idea is to devise the model in p-vector parameters \(\beta\) called coefficients, which describe the correlation between an input and an output. The model output \(y\) is the observed value and \(x\) is the independent input variable. It can be formally expressed as this:

\[y=x^{T}\beta +\epsilon =\beta_{0} +\beta_{1} x_{1}+...+\beta_{p} x_{n}+\epsilon\]

With this model definition, we could predict any \(y\) given a particular \(x\). Specifically, \(\beta_{0}\) is called the bias or intercept of the model. The \(\epsilon\) is the error term that accounts for any noise that influences the output \(y\), but is independent of x.

For our example (Gaussian Noise Linear Regression), we make the following assumptions:

  • There exists an arbitrary distribution of \(X\).
  • For any \(y_{i}\), we have \(y_{i}=\beta_{0} +\beta_{1} x_{i}+\epsilon\).
  • The noise \(\epsilon\) follows a zero-mean Gaussian distribution, i.e. \(\epsilon \sim \mathcal{N} \left( 0,\sigma^{2} \right)\).
  • \(\epsilon\) is independent across observations.

Note that the \(\epsilon\) here is different from the “constant+noise” one expressed in the linear regression model above, albeit serving the same purpose as noise. Here \(\epsilon\) is not directly added to the equation. We actually mean:

\[P\left( y|x\right) =\mathcal{N} \left( \beta_{0} +\beta_{1} x,\sigma^{2} \right)\]

Where \(\beta_{0} +\beta_{1} x\) serves as the mean of our Gaussian.

This ensures that we can have fully independent \(y\) across observations. It also allows us to express densities explicitly using Gaussian noise. Let’s write the PDF for \(y\) now:

\[P\left( y_{i}|x_{i};\beta_{0} ,\beta_{1} ,\sigma^{2} \right) =\frac{1}{\sqrt{2\pi \sigma^{2} } } e^{-\frac{\left( y_{i}-\left( \beta_{0} +\beta_{1} x_{i}\right) \right)^{2} }{2\sigma^{2} } }\]

The negative log likelihood of our model naturally follows:

\[\begin{gathered}-l\left( y|x;\beta^{\ast }_{0} ,\beta^{\ast }_{1} ,\sigma^{\ast 2} \right) =\sum^{n}_{i=1} -log\left[ \frac{1}{\sqrt{2\pi \sigma^{\ast 2} } } e^{-\frac{\left( y_{i}-\left( \beta^{\ast }_{0} +\beta^{\ast }_{1} x_{i}\right) \right)^{2} }{2\sigma^{\ast 2} } }\right] \\ =nlog\sigma^{\ast } +\frac{n}{2} log\left( 2\pi \right) +\frac{1}{2\sigma^{\ast 2} } \sum^{n}_{i=1} \left\Vert y_{i}-\left( \beta^{\ast }_{0} +\beta^{\ast }_{1} x_{i}\right) \right\Vert^{2} \end{gathered}\]

where \(\beta^{\ast }_{0} ,\beta^{\ast }_{1} ,\sigma^{\ast 2}\) are estimated parameter values.

The objective function is then:

\[\theta =\text{arg min}_{\theta } -l\left( y|\theta \right) =\left\{ \hat{\beta_{0} } ,\hat{\beta_{1} } ,\hat{\sigma^{2} } \right\}\]

Let’s try to compute \(\theta\) by taking first-order partial derivatives:


Get \(\hat{\beta }_{0}\):

\[\begin{gathered}\frac{\partial l}{\partial \beta^{\ast }_{0} } =\frac{1}{2\sigma^{\ast 2} } \sum^{n}_{i=1} -2\left( y_{i}-\left( \beta^{\ast }_{0} +\beta^{\ast }_{1} x_{i}\right) \right) =0\\ \Rightarrow \sum^{n}_{i=1} y_{i}-\left( \beta^{\ast }_{0} +\beta^{\ast }_{1} x_{i}\right) =0\\ \Rightarrow \hat{\beta }_{0} =\bar{y} -\hat{\beta }_{1} \bar{x} \end{gathered}\]

Get \(\hat{\beta }_{1}\):

\[\begin{gathered}\frac{\partial l}{\partial \beta^{\ast }_{1} } =\frac{1}{2\sigma^{\ast 2} } \sum^{n}_{i=1} 2\left( y_{i}-\left( \beta^{\ast }_{0} +\beta^{\ast }_{1} x_{i}\right) \right) \left( -x_{i}\right) =0\\ \Rightarrow \sum^{n}_{i=1} x_{i}y_{i}-\left( \beta^{\ast }_{0} x_{i}+\beta^{\ast }_{1} x^{2}_{i}\right) =0\\ \Rightarrow \hat{\beta }_{1} =\sum^{n}_{i=1} \frac{\left( x_{i}-\bar{x} \right) \left( y_{i}-\bar{y} \right) }{\left( x_{i}-\bar{x} \right)^{2} } =\frac{\text{Cov} \left( x,y\right) }{\sigma^{\ast 2} } \end{gathered}\]

Get \(\hat{\sigma }^{2}\):

\[\begin{gathered}\frac{\partial l}{\partial \sigma^{\ast } } =\frac{n}{\sigma^{\ast } } +\frac{2}{2\sigma^{\ast 3} } \sum^{n}_{i=1} \left( y_{i}-\left( \beta^{\ast }_{0} +\beta^{\ast }_{1} x_{i}\right) \right)^{2} =0\\ \Rightarrow \hat{\sigma^{2} } =\frac{1}{n} \sum^{n}_{i=1} \left( y_{i}-\left( \hat{\beta_{0} } +\hat{\beta_{1} } x_{i}\right) \right)^{2} \end{gathered}\]

We have retrieved a closed-form solution of all the parameters in a Gaussian noise linear regression by MLE.

Dr.Ziegler
If given a supervised sample dataset X with labels as Y, we could now train the estimator using this solution!

Relation to Mean Square Error

Another common way of building a linear regressor is to optimise on the minimisation of mean square error (MSE). The objective of MSE is given as:

\[\text{arg min}_{\theta } \frac{1}{n} \sum^{n}_{i=1} \left\Vert y_{i}-\left( \hat{\beta_{0} } +\hat{\beta_{1} } x_{i}\right) \right\Vert^{2}\]

This is exactly the same objective used to retrieve \(\hat{\beta_{0} } , \hat{\beta_{1} }\) in the MLE linear regression above. This implies that when we are optimising \(\beta\) using MLE, we are essentially performing an optimisation by least squares.

Barbarossa
Fun fact: this is no coincidence! Gauss is actually thinking about "what noise should be used so that least squares can be explained as MLE" when building the Gaussian model.

Relation to Bayesian Inference (MAP)

MLE is often stated as a frequentist inference tool, which it falls into naturally since it considers \(\theta\) to be a singular value obtained from sample data. However, we could also explain MLE’s logic from a Bayesian perspective.

A Bayes’ theorem is given as:

\[P\left( \theta |x\right) =\frac{P\left( x|\theta \right) P\left( \theta \right) }{P\left( x\right) }\]

The maximum a posteriori estimate (MAP) aims to find a \(\theta\) that could maximise the posterior distribution \(P\left( \theta \vert x\right)\). We could assume the prior \(P\left( \theta \right)\) as a uniform distribution. Also, since the marginal likelihood \(P\left( x\right)\) is independent of \(\theta\), we can get rid of it. We are finally left with:

\[\theta =\text{arg max}_{\theta } f\left( x|\theta \right)\]

This is essentially MLE. Thus, we can conclude that MLE is a special case of MAP in Bayesian inference, where the prior distribution is assumed to be uniform.

Nonetheless, it should be noted that MLE is not very representative of Bayesian methods in general. whereas Bayesian methods are characterised by the use of distributions to summarise data and draw inferences: thus, Bayesian methods tend to report the posterior mean or median instead, together with credible intervals.3

Summary

In the sections above we have briefly introduced the concept of MLE, and how it’s applied in deriving estimations. MLE is a straightforward frequentist tool, and in the example a closed-form solution is derived. It has been widely adopted and used in many estimation processes.

Although a closed-form solution can be derived in the simple linear regression model shown, often we cannot directly derive explicit solutions because they don’t exist, are hard to describe, or the dataset is too large to compute such solutions. In these cases we usually employ iterative methods to solve the likelihood equation, such as gradient descent or newton optimisation.

This concludes the post. I might add more information on MLE in the future that might be useful. Stay tuned if you are interested! Thanks for reading.

  1. I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. Cambridge, Massachusetts: The MIT Press, 2016. 

  2. C. Shalizi, “Lecture Notes, CMU 36-401, Fall 2015,” Lecture 6: The Method of Maximum Likelihood for Simple Linear Regression, Sep. 30, 2015. https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/06/lecture-06.pdf (accessed Dec. 19, 2021). 

  3. “Maximum a posteriori estimation,” Wikipedia. Mar. 17, 2021. Accessed: Dec. 19, 2021. Online. Available: https://en.wikipedia.org/w/index.php?title=Maximum_a_posteriori_estimation&oldid=1012559771 

Comments