Concept
In binary classification where
And we treat it as
If we are conservative in predicting positive, we can lower the threshold, such as
In statistics, those probabilities are called conditional probability expressed as
We can use logistic regression as the production model. Or use it as inference tool to understand the role of the input variables in explaining the outcome, because logistic regression produces the interpretable model.
Logistic regression output
This function is called sigmoid function.
Sigmoid Function
Use
To understand sigmoid function, start from exponential function in the denominator. Exponential function with the negative input produces values like below.
When input of exponentail function is negative of the input like
- When
, . As , . - When
, , As , .
The behavior of sigmoid function is visualized below
Using our finding from exponential function to sigmoid function,
- When
, , , . As , , , - When
, , , . As , , ,
We sometimes see a term, logistic function, instead of sigmoid function, but we treat they are the same.
For sigmoid function, we sometimes see the following equation instead of
Same because multiplying
Optimization Objective
To use logistic regression, we need to compute parameters
The above equation means that we wanna try different numbers for parameter
This math takes the form of,
The math also can be expressed as,
This cost function (loss function) behavior is visualized below.
In either
- When
, , cost . But as gets smaller, larger cost. - When
, , cost . But as gets larger, larger cost.
It makes sense because, when
Gradient descent
One of many ways to compute parameters
What we need to know is the derivative of loss function in logistic regression with respect to parameter
Derivative of log function is,
Derivative of exponential function is,
Chain rule is,
Because the loss function of logistic regression uses sigmoid function, we find the derivative of sigmoid function in advance. Use
Derivative of sigmoid function is,
Because of chain rule,
By derivative of exponential function and chain rule,
Add 1 and subtract 1 in the numerator
Taking
Because
We finished with derivative of sigmoid function. Going back to the loss function (cost function) of logistic regression. For simplicity, ignore
Simplifying it by sigmoid notation,
Derivative of cost function with respect to
By the derivative of log function and chain rule,
Because we found that the derivative of sigmoid function is
By offsetting,
Taking
Notice that
We found the derivative of loss function in logistic regression. Going back to the gradient descent. This time we use
The above is gradient descent of logistic regression without regularization.
We can include regularization in gradient descent.
Parameter Interpretation
- If
is positive, increasing , increasing probability of . - If
is negative, increasing , decreasing probability of .
One-unit increase in
Odds is the following.
In logistic regression, you cannot interpret one-unit increase in
Maximum Likelihood
Maximum likelihood is a way to compute parameters
In the Elements of Statistical Learning (ESL) binary classification where
Log-likelihood is,
Maximum likelihood sets the derivative of log-likelihood to zero to maximize the log-likelihood. The derivative is the same as the one we computed in gradient descent.
Logistic regression vs. LDA
In multiple-class classification, discriminant analysis is more popular although logistic regression can do it,
both logistic regression and LDA produces the same linear decision boundary like
- Logistic regression gets parameters by maximum likelihood. Logistic regression is better than LDA if Gaussian assumptions are not met.
- LDA gets parameters by the computed mean and variance from a normal (Gaussian) distirbution. LDA is better than logistic regression if Gaussian assumption is met and data is small.
Reference
- Machine Learning by Stanford University | Coursera
- An Introduction to Statistical Learning, 4.3 Logistic Regression
- The Elements of Statistical Learning, 4.4 Logistic Regression
- sklearn.linear_model.LogisticRegression
- The Derivative of Cost Function for Logistic Regression