Multinomial Logistic Regression Basics

Introduction

This article is for people who are curious about the math behind multinomial logistic regression (MLR)¹, and also for those curious about maximum likelihood models and the “log trick” that makes them tractable. Understanding all of the parts of this relatively simple ML technique can give you a good mental prototype for understanding much more sophisticated algorithms that rely on similar mathematical machinery.

MLR is a machine learning technique for estimating the probability of a discrete event occurring given a range of factors. A simpler version of the technique, logistic regression, is used in medical research contexts, where it answers questions like, “What is lung cancer risk for a 42-year-old overweight male who currently smokes 1.5 packs of cigarettes per day and has done so for the last 15 years, lives 1 mile away from a coal plant, and has a 0.5 cm lesion on a chest X-ray, and so on?” Multinomial logistic regression is good for answering questions like, “Given a racehorse’s age, finishing history, and medical history, as well as its jockey’s win rate and weight, how probable is it that the horse will win a race against these other 12 horses–characterized by the same variables–on this track, at this time of day, and under these conditions?” Mathematically, MLR represents a fitness function of a winning competitor divided by the sum of all the competitors, including the winner.

In case you were starting to nod off: Bill Benter famously combined MLR with custom resource allocation software that used the Kelly Criterion to tell him how much to bet on which horses at race tracks in Hong Kong. He won hundreds of millions of dollars in a little over a decade. The proliferation of machine learning libraries, which have brought modest artificial intelligence programming within the range of the serious hobbyist, have made implementing this algorithm far easier than when Benter was active. You still have to assemble and clean historical horse racing data, choose which variables to consider, create synthetic columns, NaN-sanitize, remove linearly dependent columns, write your own Kelly Criterion code, and so on. However, actually writing and performing the regression with a gradient descent algorithm are now well-supported standard practices, even bordering on a “toy” model.

In this article, I provide a high-level, implementation-friendly overview of the derivation of MLR, as well as a derivation of one of the main practical ways to solve MLR problems, the maximum likelihood estimation (MLE)².

There are three main parts to an application of logistic regression: the probability to optimize; the function of the probability to use for the optimization; and the optimization loop. I’ll cover the first two of these, the probability function and the function of the probability to optimize, in this post. The optimization loop, whose complexity I’ve swept under the rug by calling it a “part” of a logistic regression, involves the choice of a gradient descent method, hyperparameter tuning, and other detailed work that warrants a separate post. I also leave out the Extract-Transform-Load aspect of finding, cleaning, and transforming your data. However, our final equation does account for an optimization you can use when storing your data that both reduces its size and the number of calculations in deep inside your nested training loops.

The basics: utility, fitness, probability

To get the fitness function for a horse, we write its utility function as a linear combination of weighted variables. The utility function of a horse, U_h, is given by³:

\[ U_h = U_h(\bar{x}_h, \bar{\beta}) = \bar{x}_h \cdot \bar{\beta} = \sum_{i=1}^{N} x_i \beta_i \]

As a typographical convention, quantities with a bar over them are row vectors. More importantly, \(\beta_i\) represents the \(i\)th weight to apply to the \(i\)th variable \(x_i\), which gives us the relative importance of \(x_i\) in the horse’s performance. In words, this says that some notion of the comparative strength of a horse is represented by a sum of variables–think of \(x_i\) as horse_age, rider_weight, rider_win_percentage, faddah_wuzzah_muddah, and so on–each multiplied by a corresponding weight, \(\beta_i\). It is these weights that are adjusted, or trained, by the training algorithm and comprise the core of our model. Note that the magnitudes of the \(x\)s and \(\beta\)s aren’t constrained, but it’s typical, for reasons of numerical stability, to normalize the \(x\) values to a range that is mostly around \([-1,1]\) by dividing through by a characteristic maximum value over a specified epoch in the data. Initial random values for \(\beta_i\) are typically also constrained to (approximately) this range, though they may wander farther afield during optimization.

To get the fitness of a given horse, we continue by exponentiating \(U_h\):

\[ F_h = e^{U_h(\bar{x}_h, \bar{\beta})} = e^{\bar{x}_h \cdot \bar{\beta}} = e^{\sum_{i=1}^{N} x_i \beta_i}\]

Exponentiation does a number of things for us, not least of which is preventing division by zero. It’s easy to see it does so by showing that the range of \(U_h\) is \((-\infty,\infty)\), and that the range of \(F_h\) is therefore \((0,\infty)\), resulting in strictly positive denominators. Keep this in mind as you read Eqn. (3) below, which describes the probability that a given horse, \(h^*\), will win a race against a field of horses, \(H\). We represent this in terms of the ratio of our hero’s fitness function to the sum of all fitness functions in the field:

\[
\begin{align}
P(h^*) &= \frac{F^*}{\sum_{h}^{H}{F_h}} = \frac{e^{U^*(\bar{x}_{h^*}, \bar{\beta})}}{\sum_{h}^{H}{e^{U_h(\bar{x}_h, \bar{\beta})}}}\\&= \frac{e^{\bar{x}_{h^*} \cdot \bar{\beta}}}{\sum_{h}^{H}{e^{\bar{x}_h \cdot \bar{\beta}}}} = \frac{e^{\sum_{i=1}^{N}x_{h^*,i}\beta_i}}{\sum_h^H{e^{\sum_{i=1}^{N}{x_{h,i}\beta_i}}}}
\end{align}
\]

The bottom-right expression is complex, but it really just says “take the ratio of the exponentiated sum of the weighted factors that contribute to the winning horse’s fitness to the sum of the same exponentiated sums for all horses on the starting line, including the winner.” More simply, it’s the ratio of the winning horse’s fitness to the total fitness of the field. It’s easy to show that if we were to sum all of the probabilities of every horse, we would get 1, which is just \(\Omega\), the universe of possible winners for a race that completes. (Presumably, there would be refunds on your bets if the race were called off.)

Finally, given that the result of this equation is likely to be used in a gradient descent algorithm, and that there could be hundreds of weights to consider for each of a typical field of a dozen horses, we can streamline this for the computer with a little algebra. By taking advantage of the facts that \(e^0=1\) and \(\frac{e^a}{e^b}=e^{a-b}\), we can divide the numerator and denominator through beforehand by the winning horse’s fitness function, giving us a numerically more efficient expression. This results in our first concession to implementation details:

\[
\begin{align}
P_{h^*} &= \frac{1}{1 +\sum_{h \neq h^*}^{H}{F_h}} = \frac{1}{1 + \sum_{h\neq h^*}^{H}e^{(\bar{x}_h-\bar{x}_{h^*}) \cdot \bar{\beta}}}\\
&= \frac{1}{1 + \sum_{h\neq h^*}^H{e^{\sum_{i=1}^{N}{(x_{h,i}-x_{h^*,i})\beta_i}}}}
\end{align}
\]

Here, the \(1\)s in the numerator and denominator of (4) are just the result of dividing the fitness of the winning horse by itself. We account for this with “\(h\neq h^*\)” in the summation expression, reminding us to remove the winning horse from the sum in the denominator. The expressions inside the summation expression in the basement,

\[e^{(\bar{x}_h-\bar{x}_{h^*})\cdot\bar{\beta}}\] –or– \[e^{\sum_{i=1}^N(x_{h,i}-x_{h^*,i})\beta_i}\]

represent the relative fitness of each horse in the field when compared to the fitness of the winning horse. Precomputing and storing these values saves tens or hundreds of thousands of arithmetic operations per step during optimization.

Equation (4) represents the probability that a given horse will win a race, in a form that is optimized for use in bespoke maximum likelihood or cross-entropy gradient descent algorithms. For example, if you store your raw historical race data in matrices of size \((|H|,N)\), where \(|H|\) is the number of horses and \(N\) is the number of features per horse, ordered by finish position, you can generate precomputed arrays by simply subtracting the top row from each successive row and discarding the top row, resulting in \(((|H|-1),N)\) matrices. Rather than recomputing them in each loop, you can generate them once and reuse the result millions or billions of times in your training loop.

Maximum likelihood

In practical terms, we can always find a combination of weights that comes close to maximizing \(P_{h^*}\), the probability that the winning horse wins, for one specific race, with simple gradient descent or complicated differentiation. This presents us the problem of overfitting; We’ve accounted only for the particular race at hand and ignored every other horse, every other race, and even every other outcome in other races for that horse. This is a problems, for example, if days_since_last_win seemed like the determining factor in this race, but number_of_previous_races seemed the likeliest candidate in another, and so on. What we need is a way to combine or average out the individual incidental correlations between specific variables and the outcome of a race by considering more races simultaneously. The quantity of interest is called the likelihood, which is simply the product of the probabilities of a set of outcomes. In simplified form for the purposes of training a multinomial logistic regression model:

\[L(\beta)=\prod_{i=1}^N{P_i}\]

We can use this to motivate a discussion of maximum likelihood estimate (MLE). The MLE is just the argmax of the product of the probabilities, or the weights that maximize the product of the probabilities from multiple trials, subject to the constraints of our model:

\[\mathbf{\hat{\boldsymbol{\beta}}}=argmax(L(\boldsymbol{\beta}))=argmax(\prod_r^R{P_{h^*}(\mathbf{x},\boldsymbol{\beta}}))\]

Here, \(r\) indicates an individual race, and \(R\) indicates all the races involved in this likelihood. In machine learning terms, the size of \(R\) is the batch size, the number of items that you operate on in a stretch of gradient descent steps within in an iteration of your training loop. You can think of \(\mathbf{\hat{\boldsymbol{\beta}}}\) as either the optimized weights themselves–which they are–or as the resulting model with the \(\beta\)s substituted in, depending on context.

It might not be immediately clear how maximizing the likelihood gives us the best estimator subject to the constraints of our model. It helps to first consider the case of trying to determine the bias in a coin from looking at a string of tosses. We might represent these, for example, as a string such as HTHHHTTTTT. In this case, heads came up four times and tails came up six times. Each flip, however, represented two possible outcomes, heads or tails, of which one actually happened. That is, there is a full binary tree of possible outcomes for, in this case, 10 flips of a coin: The first flip could’ve been H or T; The second flip, in either case, could’ve been H or T; and so on. Our HTHHHTTTTT string represents a path through this tree that records results that actually happened. The likelihood of this path is just the conditional probabilities along this path, which is the product of the probabilities of the actual outcomes.

Given that each result actually happened, it intuitively makes sense that we want to maximize the probability that each of them happened, and this is exactly what maximizing the likelihood does: It maximizes the joint probability for the path that the actual outcomes trace through the probability tree. Since every event in the data set happened, the model is pulled in the direction of \(P=1\) for all of the recorded events, hence the maximum likelihood, subject to the constraints of the model.

Example: Further examination of a possibly unfair coin

Consider again the case of successive coin tosses. If we were looking for the probability that a coin results in heads when tossed, \(P(H)\), and we obtained the sequence \(HHHTH\) from a series of 5 tosses, the corresponding product of the event probabilities would be \(P(H)^4(1-P(H))^1\). This expression is differentiable, and after taking the derivative, simplifying, and identifying \(P(H)=0\) and \(P(H)=1\)as a minimums over the domain, we find that the the maximum likelihood occurs when \(P(H)=0.8\). The corresponding likelihood is \((0.8^4)(0.2)\approx 0.082\). If we instead assumed this was a fair coin, for example, then the likelihood would be \(0.5^5\approx0.031\), a much lower value. If we assumed that actually P(H)=0.9, then the corresponding likelihood is \((0.9^4)(0.1)\approx0.066\), another value lower than our maximum. That is, in this case where our model is just \(P(T)=1-P(H)\), straying from the predicted \(\beta\) lowers the maximum likelihood.

Four heads and one tails comes up all the time when flipping fair coins in batches of five, and our estimate could very likely be wrong for this particular coin because of our small sample size. However, maximum likelihood estimators respect the data: They choose the model that best explains the data that you have. Choosing a larger batch size (assuming we can generate a larger set of coin flip results) would take advantage of the Law of Large Numbers. Repeating for multiple batches would allow us to invoke the Central Limit Theorem to further increase our confidence in our result. As hinted at before, both the batch size and number of batches over which to optimize arise as hyperparameters in actual machine-learning algorithms.

You might have noticed that Eqns. (3) and (4) don’t involve expressions of the form \((1-P)\), but the coin model involves that quantity explicitly. We’ve chosen to train on winners only, because every outcome yields some information. We could, for example, choose a mix of races where \(h^*\) didn’t win, in which those terms would indeed involve a \((1-P)\) term, where \(P\) has the same form as Eqn. (4). The rule is that we consider products of probabilities of outcomes. It just turns out that \(P(T)=1-P(H)\), and \(P_{h^*,i}\) is a complex expression that involves hundreds of interrelated variables that implicitly account for the strengths of the non-winning horses.

Log likelihood

Eqn. (5) certainly represents a truth about the weights that result in the most likely distribution of probabilities between our race winners. The trouble is that it’s a product, and this has two drawbacks. First, multiplication is slightly slower than addition and subtraction. Second, and far more importantly, is the successive application of the product rule when calculating derivatives; the number of derivatives that must be calculated grows exponentially with the batch size. Given that we’ve hinted that larger batch sizes might result in more accurate estimators, we would like to do something about this. That something is the log trick.

We make use of three properties of the logarithm to lessen our model’s impact on our expensive and time-consuming GPU cycles. First, logarithmic functions are monotonically increasing, so larger likelihoods always correspond to larger log likelihoods. Therefore, under a log transformation, we would expect \(argmax\) to remain well-behaved. Second, logs transform products into sums. This allows us to dispense with calculating the exponentially growing raft of derivatives that we were faced with when we were thinking about operating directly on the products of the probabilities. Finally, successive multiplications by numbers less than one drive the result closer to zero. By transforming multiplication into summation, we drive the absolute value of our result away from zero, away from a danger zone where floating point numbers would be likely to give us large relative errors.

Thus:

\[\mathcal{l}(\boldsymbol{\beta})=log(L(\boldsymbol{\beta}))=log(\prod_r^{R}P_{h^*})=\sum_{r}^{R}log(P_{h^*})\]

where \(\mathcal{l}()\) is the log likelihood. Our estimator becomes:

\[\hat{\boldsymbol{\beta}}=argmax({log(\prod{P_{h^*}}}))=argmax(\sum_{r}^R{log(P_{h^*}}))\]

Substituting in for \(P_{h^*}\) yields the dot product formulation of an MLE estimator for an MLR:

\[\hat{\boldsymbol{\beta}}=argmax(\sum_{r}^R{log(\frac{1}{1 + \sum_{h\neq h^*}^H e^{(\bar{x}_h – \bar{x}_{h^*})\cdot{\hat{\beta}}}}})\]

This is of course equivalent to the summation notation version:

\[\hat{\boldsymbol{\beta}}=argmax(\sum_{r}^R{log(\frac{1}{1 + \sum_{h\neq h^*}^H e^{\sum_{i=1}^N(x_{h,i}-x_{h^*,i})\beta_i}}})\]

Conclusion

Eqns. (9) and (10) are real accomplishments. Similarly constructed equations are at the heart of machine learning and artificial intelligence algorithms that are vastly more capable and/or general than MLR. Yet, in spite of its specialized application range, Eqns. (9) and (10) represent as good a prototype as any upon which to build your intuition for the techniques used to bring these other intractable computations down into the reach of our CPUs and GPUs and up from the noise floor of floating-point errors. The trick of precomputing relative fitnesses, Eqn. (4), offers comparatively modest, though still significant, gains.

Of more immediate importance for you is that, with a tutorial in Tensorflow, Pytorch, or another GPU-enabled machine learning library–and armed with knowledge of techniques like hyperparameter tuning, L2 regularization, dropout, and so on–you should now be able try your hand at beating the races by translating Eqns. (9) and (10) into a model that can be directly and efficiently trained with out-of-the-box optimizers. Good hunting!

Notes:

Or at least what horse race mathematicians call the technique when they apply it. True multinomial logistic regression predicts unorderable categories, such as whether certain people are more likely to own a dog, cat, or lizard.
One fortunate little oddity that nobody really explains to you when you’re starting out is that MLE is actually just about equivalent to the much-more-intimidating-sounding cross-entropy techniques. They differ mainly by sign: You maximize likelihood and minimize cross-entropy, but you’re operating on expressions that are simply negative multiples of each other. The math is interesting, and the insights gained from understanding cross-entropy are profound, but you can take it on faith that you’re getting to the same exact place, no matter which of the two methods you choose.
Don’t be intimidated by the string of equal signs. I just want to give you a list of definitions that tend from more “name-like” to more “implementation-like.” If you’re interested in the theory, you should consider absorbing the various ways in which the gritty details are gradually substituted in. If you’re concerned with making your millions at the track, the rightmost quantities are generally easiest to translate into code. All readers should puzzle out for themselves the equivalence between the dot product formulations and the summation formulations, if only to develop a facility with notation and to get an idea of when to rely on optimized APIs available in various machine learning libraries. You’ll typically use the dot product, rather than iterating over the variables and weights, when you use libraries like TensorFlow.