Often we face the problem of mapping data into pre-defined classes or groups. This process of predicting the categorical/qualitative response for an observation is called ‘classification.’ Example: Predicting whether a patient is healthy or sick based on the symptoms.
Statistically, in classification models we estimate Pr(Y|X) instead of predicting Y directly. There are two common approaches for modeling Pr(Y|X):
Discriminative : We directly model Pr(Y|X)
Example: Logistic regression, K-Nearest neighbors method.
Generative: We estimate Pr(X|Y) and Pr(Y) and then using Bayes’ theorem to model Pr(Y|X).
Example: Naive Bayes, Linear Discriminant Analysis.
In this article, I will provide a brief overview of logistic regression.
Why use logistic regression and not linear regression?
- One of the difficulties in using linear regression for modeling qualitative response variables is that there is no natural way of converting a categorical variable with more than two levels into a continuous variable.
- For binary response variables, it can be shown that linear regression provides an estimate of Pr(Y = 1|X). Though these predictions provide an ordering, it is pertinent to note that some probability estimates might be outside the interval [0,1].
- Instead of modeling Y directly, logistic regression models the probability that Y belongs to a particular class/category.
Mathematical form of the logistic regression model:
In logistic regression, we use the logistic function (an example of the sigmoid function), which is of the form:

This function ensures that the probability values always lie between 0 and 1. After rearranging the terms in equation (1), we get the following:

The left hand side of equation (2) is called the ‘odds’ and it can take any value between 0 and infinity. If we take logs on both sides of equation (2), we get,

The left hand side of equation (3) is called the ‘log-odds’ or ‘logit’, which has a linear relationship with X (predictor variable).
Where is the error term from linear regression?
Notice that,

Since, the normal distribution has two parameters, though our main focus is on estimating the mean, we also need to estimate the second parameter, i.e. variance.
However, binary logistic regression uses Bernoulli distribution and we only need to estimate the parameter of that distribution, i.e. p(x) or Pr(Y=1|X), which also happens to be its mean. Hence, equation (3) has no error term.
Simulation:


Interpreting coefficient estimates in logistic regression:
The standard method for estimating the unknown coefficients of the logistic regression is maximum likelihood. Cost function for the logistic regression is calculated by taking the negative logarithm of its maximum likelihood function. The coefficients estimates correspond to the minima of this cost function which is obtained using a method called stochastic gradient descent. I will cover this method and the math behind it in one of my upcoming blogs.
In general terms,

Example: The coefficients from the logistic fit for the simulated data are:

Hence, a unit increase in x increases the log odds by 3.1023 or multiplies the odds by 22.
References:
Wikipedia: sigmoid function (https://en.wikipedia.org/wiki/Sigmoid_function)
James, G., Witten, D., Hastie, T., and Tibshirani,R., An Introduction to Statistical Learning with Applications in R (Springer)