Logistic Regression

Logistic regression is a classification algorithm used to predict the probability of a binary outcome. It is used when the output is a discrete binary value (one two possible). Note: Despite its name, logistic regression is used for classification, not regression!

Variable Naming:

  • $m$: Number of training examples
  • $x$: Feature (input variable)
  • $x^{(i)}$: Feature of the $i$-th training example (one-based index)
  • $y$: Target (output variable)
  • $y^{(i)}$: Target of the $i$-th training example (one-based index)
  • $\hat{y}$: Prediction
  • $w$: Weight
  • $b$: Bias
  • $\alpha$: Learning rate

Model

The model of logistic regression is represented by the function: $$ f_{w,b}(x) = \frac{1}{1 + e^{-(w \cdot x + b)}} $$ The function $$ g(z) = \frac{1}{1 + e^{-z}} $$ is called the sigmoid function or logistic function.

Cost

The squared error cost function used in linear regression is not suitable for logistic regression. The cost function for logistic regression uses the log loss function which is defined as: $$ L(f_{w,b}(x^{(i)}), y^{(i)}) = -\log(\hat{y}) $$ if $y = 1$ and $$ L(f_{w,b}(x^{(i)}), y^{(i)}) = -\log(1 - \hat{y}) $$ if $y = 0$.

The cost function is then defined as: $$ J(w,b) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right] $$

Prediction

Predictions are made using the learned values of $w$ and $b$: $$ \hat{y} = \frac{1}{1 + e^{-(w \cdot x + b)}} $$ The prediction is then rounded to 1 or 0 based on the threshold (usually 0.5, but not necessarily!).

Gradient Descent

The run gradient descent the partial derivatives of the cost function with respect to $w$ and $b$ are needed. The partial derivative of the cost function with respect to $w$ is: $$ \frac{\partial}{\partial w}J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)}) \cdot x^{(i)} $$ The partial derivative of the cost function with respect to $b$ is: $$ \frac{\partial}{\partial b}J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)}) $$

Additional

The same concepts of learning curves, vectorized implementation and feature scaling as in linear regression can be applied to logistic regression.

Overfitting and Underfitting

Over- and underfitting are problems in machine learning where the model does not generalize well to new, unseen data. The concepts of bias and variance are used to describe these problems.

Bias

Bias describes the preconceptual assumptions made by the model. High bias can cause underfitting because the model is too simple to capture the underlying structure of the data. Too few features or too simple a model can lead to high bias.

Reduce Bias

There are several ways to reduce bias.

  1. More Features: Adding more features (by engineering or selection) can help reduce bias.
  2. More Complex Model: Using a more complex model can help reduce bias.
  3. Regularization: Regularization can help reduce bias by penalizing large weights.

Variance

Variance describes the amount by which the model would change if it were estimated using a different training data set. High variance is a sign of an overfit model. Too many features or too complex a model can lead to high variance.

Reduce Variance

There are several ways to reduce variance.

  1. More Data: More training data can help reduce variance.
  2. Feature Selection: Excluding or including features can help reduce variance.
  3. Regularization: Regularization can help reduce variance by penalizing large weights.