Linear Regression

Linear regression tries to find the best-fitting linear function that maps the input to the output. It is used when the output is a continuous value.

Variable Naming:

  • $m$: Number of training examples
  • $x$: Feature (input variable)
  • $x^{(i)}$: Feature of the $i$-th training example (one-based index)
  • $y$: Target (output variable)
  • $y^{(i)}$: Target of the $i$-th training example (one-based index)
  • $\hat{y}$: Prediction
  • $w$: Weight
  • $b$: Bias

Model

The model of a linear regression (with one feature) is represented by a mathematical function: $$ f_{w,b}(x) = w \cdot x + b $$

Cost

To measure the performance of the model, a cost function is used. For linear regression, the most commonly used cost function is the mean squared error: $$ J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} (y^{(i)} - f_{w,b}(x^{(i)}))^2 $$

The goal is to find the values of $w$ and $b$ that minimize the cost function $J(w,b)$. Gradient descent is an optimization algorithm used to minimize the cost function.

Prediction

Predictions are made using the learned values of $w$ and $b$: $$ \hat{y} = f_{w,b}(x) = w \cdot x + b $$

Gradient Descent

Gradient descent is an iterative optimization algorithm used to minimize a function. It works by moving in the direction of the steepest descent of the cost function. This means that if its step size has been chosen correctly, it will converge to a local minimum.

The update for $w$ is: $$ w := w - \alpha \frac{\partial}{\partial w}J(w,b) $$ where $\alpha$ is the learning rate (the “step” size) and the partial derivative is: $$ \frac{\partial}{\partial w}J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)}) \cdot x^{(i)} $$

For $b$, the update is similar: $$ b := b - \alpha \frac{\partial}{\partial b}J(w,b) $$ with the partial derivative: $$ \frac{\partial}{\partial b}J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)}) $$

Important for implementation: The updates for $w$ and $b$ are done simultaneously!

The learning rate $\alpha$ is a hyperparameter that needs to be chosen carefully. If it is too small, the algorithm will take a long time to converge. If it is too large, the algorithm might overshoot the minimum and never reach a minimum (might even diverge!).

Multiple Linear Regression

Multiple linear regression is an extension of linear regression to multiple input variables. Important: Multivariate linnear regression is not the same thing as multiple linear regression!

Variable Naming:

  • $n$: Number of features
  • $x_j$: Feature $j$
  • $\vec{x}^{(i)}$: Feature vector of the $i$-th training example (one-based index)
  • $x_j^{(i)}$: Value of feature $j$ of the $i$-th training example (one-based index)

Model

The model of a multiple linear regression is represented by a mathematical function: $$ f_{w,b}(x) = w_1 \cdot x_1 + w_2 \cdot x_2 + \ldots + w_n \cdot x_n + b $$ where $x_1, x_2, \ldots, x_n$ are the features, $w_1, w_2, \ldots, w_n$ are the weights, and $b$ is the bias.

Written as the dot product, which makes it easier to implement in code: $$ f_{\vec{w},b}(\vec{x}) = \vec{w} \cdot \vec{x} + b $$ where $w$ and $x$ are vectors. Which in Python look like this, using NumPy: np.dot(w, x) + b.

Gradient Descent for Multiple Linear Regression

The updates for $w$ and $b$ are similar to the updates for linear regression, but the partial derivatives are different.

The update for $w$ is: $$ w_j := w_j - \alpha \frac{\partial}{\partial w_j}J(w,b) $$ where the partial derivative is: $$ \frac{\partial}{\partial w_j}J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)}) \cdot x_j^{(i)} $$

For $b$, the update is similar: $$ b := b - \alpha \frac{\partial}{\partial b}J(w,b) $$ with the partial derivative: $$ \frac{\partial}{\partial b}J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)}) $$

Normal Equation

The normal equation is an analytical approach to finding the values of $w$ and $b$ that minimize the cost function. It is an alternative to using an iterative optimization algorithm like gradient descent. Some machine learning libraries use the normal equation to solve for $w$ and $b$ instead of gradient descent.

Feature Scaling

Feature scaling is a technique used to standardize the range of independent variables or features of the data. It can help the gradient descent algorithm converge more quickly. It is performed during the data preprocessing step.

Acceptable ranges for feature values are between -1 and 1 or between 0 and 1. If values are slightly outside these ranges, it is usually not a problem. If in doubt, it is always a good idea to scale the features.

Mean Normalization

Mean normalization is a feature scaling technique that scales the values of the features to be between 0 and 1. It is done by subtracting the minimum value of the feature and dividing by the range of the feature. To normalize a feature $x$: $x_{\text{normalized}} = \frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)}$

Z-Score Normalization

Z-score normalization is a feature scaling technique that scales the values of the features to have a mean of 0 and a standard deviation of 1. It is done by subtracting the mean of the feature and dividing by the standard deviation of the feature. To normalize a feature $x$: $x_{\text{normalized}} = \frac{x - \mu}{\sigma}$

Learning Curve

The learning curve is a plot of the cost function over the number of training iterations. It is used to find the optimal learning rate and to detect when gradient descent has converged. If the learning curve ever increases, it is a sign that:

  • either the learning rate is too large
  • or the implementation is incorrect

An alternative to the learning curve is an automatic convergence test. This test checks if the cost function has decreased by less than a certain threshold over the last $n$ iterations. If this is the case, the algorithm has likely converged and training can be stopped.

Regularization

Regularization is a technique used to prevent overfitting by penalizing large weights. It is done by adding a term to the cost function that depends on the weights.

In the course, the following regularization term is used: $$ \text{regularization term} = \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2 $$ where $\lambda$ is the regularization parameter and $n$ is the number of features.