Why Logistic?

For many machine learning beginners, the logistic function ${1\over 1 + \exp(-x)}$ seems quite unnatural, with its various usages in machine learning, it is often confusing why we want to use it, here I will give a brief motivation for logistic function.

Nice properties

A good motivation of using some thing in machine learning is good mathematical property that makes our life easier. For example, there is often no reason why we choose certain conjugate priors other than we can do inference easily.

Logisitic function has nice symmetry property that $1-f(x)=f(-x)$, and it is an offset and scaled hyperbolic tangent function that saturates to 0 or 1 when the absolute value goes large, this make it adorable as an activation function for neural networks. Also it has very pretty derivative,

It is extremely easy to get the derivative once we have calculated the funcion, no wander people like to use it!

Modeling log-odds ratio in logistic regression

Logistic regression is a widely used binary classifier. Suppose $\mathrm{x}$ is the input feature vector, and $\mathrm{w}$ is the weight of the model, we can write logistic regression in $\Pr(y=1\mid \mathrm{w, x})={1\over 1+\exp(-\mathrm{w}^T\mathrm{x})}$. The sigmoid function here serves as a probability predictor, it squashes a real value $-\mathrm{w}^T\mathrm{x}$ into $(0, 1)$, hence we can use it to represent probability.

But the real motivation underlying is not that because it is useful for transforming some real value into probability and we don’t have any other choice. Actually we have a more formal and convincing motivation. Considering the log-odds ratio, which is the log of the ratio between the probability of choosing two outcomes for binary classification:

By imaging the log curve, we can tell that this value has no constraint, it is 0 if the two outcome has equal probability and goes positive if the outcome in the nominator has larger probability otherwise it goes negative.

What if we model this value by a linear model $\mathrm{w}^T\mathrm{x}$ ? By simply drawing a equation, and noticing that the probability of two outcomes sum to one, we can derive that

Yes, what we have is exactly the logistic regression. By using the sigmoid function on top of a linear model, we are actually modeling the log-odds ratio. In this case, logistic regression can be treated as a generalized linear model that is well-known in statistic community, where people use a transformation over linear model to model another quantity of interest. 1

  1. A First Course in Machine Learning, by Simon Rogers, Mark Girolami