04 Logistic Regression
Logistic Regression
Logistic regression is a classification algorithm used to predict the probability of a binary outcome. It is used when the output is a discrete binary value (one two possible). Note: Despite its name, logistic regression is used for classification, not regression!
Variable Naming:
- $m$: Number of training examples
- $x$: Feature (input variable)
- $x^{(i)}$: Feature of the $i$-th training example (one-based index)
- $y$: Target (output variable)
- $y^{(i)}$: Target of the $i$-th training example (one-based index)
- $\hat{y}$: Prediction
- $w$: Weight
- $b$: Bias
- $\alpha$: Learning rate
Model
The model of logistic regression is represented by the function: $$ f_{w,b}(x) = \frac{1}{1 + e^{-(w \cdot x + b)}} $$ The function $$ g(z) = \frac{1}{1 + e^{-z}} $$ is called the sigmoid function or logistic function.
Cost
The squared error cost function used in linear regression is not suitable for logistic regression. The cost function for logistic regression uses the log loss function which is defined as: $$ L(f_{w,b}(x^{(i)}), y^{(i)}) = -\log(\hat{y}) $$ if $y = 1$ and $$ L(f_{w,b}(x^{(i)}), y^{(i)}) = -\log(1 - \hat{y}) $$ if $y = 0$.
The cost function is then defined as: $$ J(w,b) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right] $$
Prediction
Predictions are made using the learned values of $w$ and $b$: $$ \hat{y} = \frac{1}{1 + e^{-(w \cdot x + b)}} $$ The prediction is then rounded to 1 or 0 based on the threshold (usually 0.5, but not necessarily!).
Gradient Descent
The run gradient descent the partial derivatives of the cost function with respect to $w$ and $b$ are needed. The partial derivative of the cost function with respect to $w$ is: $$ \frac{\partial}{\partial w}J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)}) \cdot x^{(i)} $$ The partial derivative of the cost function with respect to $b$ is: $$ \frac{\partial}{\partial b}J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)}) $$
Additional
The same concepts of learning curves, vectorized implementation and feature scaling as in linear regression can be applied to logistic regression.
Overfitting and Underfitting
Over- and underfitting are problems in machine learning where the model does not generalize well to new, unseen data. The concepts of bias and variance are used to describe these problems.
Bias
Bias describes the preconceptual assumptions made by the model. High bias can cause underfitting because the model is too simple to capture the underlying structure of the data. Too few features or too simple a model can lead to high bias.
Reduce Bias
There are several ways to reduce bias.
- More Features: Adding more features (by engineering or selection) can help reduce bias.
- More Complex Model: Using a more complex model can help reduce bias.
- Regularization: Regularization can help reduce bias by penalizing large weights.
Variance
Variance describes the amount by which the model would change if it were estimated using a different training data set. High variance is a sign of an overfit model. Too many features or too complex a model can lead to high variance.
Reduce Variance
There are several ways to reduce variance.
- More Data: More training data can help reduce variance.
- Feature Selection: Excluding or including features can help reduce variance.
- Regularization: Regularization can help reduce variance by penalizing large weights.