We'll be exploring gradient descent and maximum likelihood estimation (and possibly least squares).

Mathematica Notebooks.

Least Squares Regression

Let's start by assuming we have a collection of $n$ data points $(x_i,y_i)$ for $i$ from 1 to $n$.

f[x_] := 2 x + 3;
sigma = 1;
residuals = Round[Table[InverseCDF[NormalDistribution[0, sigma]][RandomReal[]], {i, 1, 10}], 0.01];
Table[{i, f[i] + residuals[[i]]}, {i, 1, 10}]
Show[
 ListPlot[Table[{i, f[i] + residuals[[i]]}, {i, 1, 10}]],
 Plot[f[x], {x, 1, 10}, PlotStyle -> Dashed]
 ]

After creating a scatterplot of the data, we observe that the data seems to have a linear trend $f(x,m,b) = mx+b$. Note that this function depends on $x$, $m$, and $b$. A residual $r_i = y_i-f(x_i,m,b)$ is the distance between an observed value $y_i$, and a predicted value $f(x_i,m,b)$. The least squares problem ask us to find the coefficients $m$ and $b$ so that the sum of the squared residuals is minimized, so we want to minimize the function $$S(m,b) = \sum_{i=1}^{100}(y_i-f(x_i,m,b))^2.$$ We need to do the following:

  • Find the critical points of $S$. (Set $\vec \nabla S(m,b) = (0,0)$ and solve for $m$ and $b$.)
  • Determine if the critical point is a maximum or minimum of $S$.

Suppose instead that after plotting the data, we noticed that the data appears to follow a parabolic trend, so $f(x,a_0,a_1,a_2) = a_2 x^2+a_1 x+a_0$. Repeat the above and find the coefficients $a_0$, $a_1$, $a_2$ which provide a minimum for $$S(a_0,a_1,a_2) = \sum_{i=1}^{100}(y_i-f(x_i,a_0,a_1,a_2))^2.$$

  • What would our steps be?

Probability Density functions, and Maximum Likelihood estimation

Another way to fit data to a model is to utilize probabilities. There are may probability models used to explore data, but we'll examine 2 of them today.

Exponential Distribuation

When trying to measure how long one must wait between 2 events to occur (how long between two customer service calls, how long between clicks on a Geiger counter, etc. - see Wikipedia for more examples), it's common to use the exponential distribution. The probability density function for the exponential distribution is given by

  • $f(x) = \lambda e^{-\lambda x}$ for $x\leq 0$ some constant $\lambda>0$.

Here is a Desmos app to see what the parameter $\lambda$ signifies in the distribution. The word "density" above states that $f(x)$ is a probability per something, in this case length (or time, or whatever unit of measurement we attach to the $x$-axis). The number $\lambda$ is a parameter for the exponential model, that can be determined from data.

Suppose we are studying the time between incoming calls to a call center.

  • To compute the probability that the next call will happen in less than 2 minutes, we compute $\int_0^2 \lambda e^{-\lambda x}dx$.
  • The probability that the next call will happen sometime in the future is $\int_0^{\infty} \lambda e^{-\lambda x}dx$. This should equal 1, and every probability density function satisfies $\int_C f(x) dx = 1$.
  • The average time between call lengths is $\ds \bar x = \frac{\int_0^{\infty} x\lambda e^{-\lambda x}dx}{\int_0^{\infty} \lambda e^{-\lambda x}dx} = \frac{\int_0^{\infty} x\lambda e^{-\lambda x}dx}{1} $.

Over the course of a day at a call center, we kept track of how long we needed to wait between each call. The average wait time of these 100 calls was 2.5 minutes. How can we use this information to determine the parameter $\lambda$.

  • There are two ways. The method of moments (an integration technique) says to compare the observed average (here 2.5 minutes) with the theoretical average (here $\int_0^{\infty} x\lambda e^{-\lambda x}dx$), and then solve for the parameters. This gives $2.5 = \frac{1}{\lambda}$, which means $\lambda = 0.2$. We're done.
  • Another method looks at probabilities, and uses derivatives to address the questions. If each wait times was independent of the other wait times, then we can multiply probability densities together to obtain a joint probability. This gives a joint probability of $\prod_{i=1}^{100}\lambda e^{-\lambda x_i}$. If we think of $\lambda$ as a variables, fixing the data $x_i$, then we obtain what's called the likelihood function (what's the likelihood of observing $x_1, x_2, \ldots x_{100}$ if $\lambda$ is the parameter).
    • The question now boils down to maximizing the likelihood, by treating $\lambda$ as the variable.
    • Note that taking the derivative of a product can be quite daunting, so instead we maximize the loglikelihood function $\ln (\prod_{i=1}^{100}\lambda e^{-\lambda x_i})$. Let's do so now.
Normal Distribution

Perhaps the most important probability distribution is the normal distribution. This distribution's shape is often called a "bell curve." The formula for the distribution is

  • $f(x) = \frac{1}{\sigma \sqrt{2\pi} } e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$ for some constants $\mu$ and $\sigma>0$.

Here is a Desmos app to see what the parameters $\mu$ and $\sigma$ signify in the function above.

One important thing to note is that for $f(x)$ to be a probability density function, we must have $\int_{-\infty}^{\infty}f(x)dx=1$. Proving this is true is no small task, one that we now have the tools to do.

  • Let's show that $\int_{-\infty}^{\infty}\frac{1}{\sigma \sqrt{2\pi} } e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}dx = 1$.
    • First use the substitution $z = \frac{x-\mu}{\sigma}$.
    • Then square the integral, combine it into a single double integral, and use polar coordinates to finish.

Mathematica has the integral above built in.

g[x_] := 1/(\[Sigma] Sqrt[2 Pi]) Exp[-1 ((x - \[Mu])/\[Sigma])^2/2]
Integrate[g[x], {x, -Infinity, Infinity}]
Assuming[\[Sigma] > 0, Integrate[g[x], {x, -Infinity, Infinity}]]

We now return to the original problem, that of fitting data to a model. Let's use maximum likelihood estimation to determine the parameters $m$ and $b$ in the model $f(x) = mx+b$.

f[x_] := 2 x + 3;
sigma = 1;
residuals = Round[Table[InverseCDF[NormalDistribution[0, sigma]][RandomReal[]], {i, 1, 10}], 0.01];
Table[{i, f[i] + residuals[[i]]}, {i, 1, 10}]
Show[
 ListPlot[Table[{i, f[i] + residuals[[i]]}, {i, 1, 10}]],
 Plot[f[x], {x, 1, 10}, PlotStyle -> Dashed]
 ]


Today

« April 2023 »

Sun

Mon

Tue

Wed

Thu

Fri

Sat

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30