Concept

In binary classification where is labelled as 0 for negative and 1 for positive, logistic regression outputs a floating number between 0 and 1. Let be the output of logistic regression. Use and for parameters of logistic regression.

And we treat it as

If we are conservative in predicting positive, we can lower the threshold, such as . It means that, for example, suppose means someone at default. We don't have to wait until the model outputs something more than 0.5. As low as 0.1, we can just judge it to be default, if we are conservative.

In statistics, those probabilities are called conditional probability expressed as and . For example, we interpret as, the probability to predict when we have the data and when we use the parameters .

We can use logistic regression as the production model. Or use it as inference tool to understand the role of the input variables in explaining the outcome, because logistic regression produces the interpretable model.

Logistic regression output is computed by,

This function is called sigmoid function.

Sigmoid Function

Use for sigmoid function below,

To understand sigmoid function, start from exponential function in the denominator. Exponential function with the negative input produces values like below.

When input of exponentail function is negative of the input like ,

When , . As , .
When , , As , .

The behavior of sigmoid function is visualized below

Using our finding from exponential function to sigmoid function,

When , , , . As , , ,
When , , , . As , , ,

We sometimes see a term, logistic function, instead of sigmoid function, but we treat they are the same.

For sigmoid function, we sometimes see the following equation instead of , but it's the same as the equation we have been using so far.

Same because multiplying to both numerator and denominator of ,

Optimization Objective

To use logistic regression, we need to compute parameters . To achive that, we set the following optimization objective. is the number of data. is the number of features. is logistic regression model output (sigmoid function). is parameters. is regularization parameter.

The above equation means that we wanna try different numbers for parameter to minimize the above value.

This math takes the form of,

The math also can be expressed as,

This cost function (loss function) behavior is visualized below.

In either or , sigmoid function can output a number between 0 and 1. But because of the cost function, depending on label , the computed cost differs.

When , , cost . But as gets smaller, larger cost.
When , , cost . But as gets larger, larger cost.

It makes sense because, when , logistic regression needs to predict a number , and when , we want logistic regression to predict a number . So in such cases, the cost is smaller. We want to minimize the cost.

Gradient descent

One of many ways to compute parameters (or ) is gradient descent. Gradient descent is to take derivative of loss function (cost function) with respect to parameters and to iteratively update parameters. is current parameter. is updated parameter. is step size. is loss function.

What we need to know is the derivative of loss function in logistic regression with respect to parameter . The following is the necessary derivative rules.

Derivative of log function is,

Derivative of exponential function is,

Chain rule is,

Because the loss function of logistic regression uses sigmoid function, we find the derivative of sigmoid function in advance. Use for sigmoid function here.

Derivative of sigmoid function is,

Because of chain rule,

By derivative of exponential function and chain rule,

Add 1 and subtract 1 in the numerator

Taking outside,

Because is sigmoid function,

We finished with derivative of sigmoid function. Going back to the loss function (cost function) of logistic regression. For simplicity, ignore and negative signs for and .

Simplifying it by sigmoid notation,

Derivative of cost function with respect to is,

By the derivative of log function and chain rule,

Because we found that the derivative of sigmoid function is and by chain rule,

By offsetting,

Taking outside,

Notice that also offset.

We found the derivative of loss function in logistic regression. Going back to the gradient descent. This time we use and negative sign.

The above is gradient descent of logistic regression without regularization.

We can include regularization in gradient descent. is index for features. is the number of features. is intercept.

Parameter Interpretation

If is positive, increasing , increasing probability of .
If is negative, increasing , decreasing probability of .

One-unit increase in is associated with an increase in the log odds (logit) of by units. Log odds is the following.

Odds is the following.

In logistic regression, you cannot interpret one-unit increase in associated with the probability unit change.

Maximum Likelihood

Maximum likelihood is a way to compute parameters . The goal of maximum likelihood is to find parameters such that the likelihood function output is maximized.

In the Elements of Statistical Learning (ESL) binary classification where , is the probability that label is positive, and is the probability that label is negative. is the number of data. is the parameters. ESL uses the following equation for log-likelihood for maximum likelihood.

Log-likelihood is,

Maximum likelihood sets the derivative of log-likelihood to zero to maximize the log-likelihood. The derivative is the same as the one we computed in gradient descent.

Logistic regression vs. LDA

In multiple-class classification, discriminant analysis is more popular although logistic regression can do it,

both logistic regression and LDA produces the same linear decision boundary like . But the ways to estimate parameters are different. Also, the result decision boundary could be different.

Logistic regression gets parameters by maximum likelihood. Logistic regression is better than LDA if Gaussian assumptions are not met.
LDA gets parameters by the computed mean and variance from a normal (Gaussian) distirbution. LDA is better than logistic regression if Gaussian assumption is met and data is small.

Reference

Machine Learning by Stanford University | Coursera
An Introduction to Statistical Learning, 4.3 Logistic Regression
The Elements of Statistical Learning, 4.4 Logistic Regression
sklearn.linear_model.LogisticRegression
The Derivative of Cost Function for Logistic Regression