In one of his books, Isaac Asimov envisions a future where computers have become so intelligent and powerful, that they are able to answer any question. In that future, Asimov postulates, scientists don’t become unnecessary. Instead, they’re left with a difficult task: figuring out how to ask the computers the right questions: those that yield an insightful, useful answer.

We’re not quite there yet, but in some sense we are.

In times of old, it used to be the case that a lot of the effort in machine learning went into the implementation of its mechanics. With the advent of popular machine learning frameworks such as Tensorflow, Pytorch and the like, we can happily shrug away that burden. The mechanics of tweaking a model’s weights to achieve a certain goal is abstracted away. Defining a custom optimization goal is as easy as writing it down, and your favorite deep learning framework chugs away and minimizes the error (or loss) accordingly. But this freedom brings about an all-important question:

What goal should you optimize for?

### The two types of Losses (and which is more important)

We often find ourselves measuring 2 types of losses:

Training loss, which we actively optimize the model on. At every iteration, our model takes the derivative of training loss with respect to all of its degrees of freedom, and tries to lower the training loss directly.

Validation loss, by which we measure our model’s performance. Here we make a bunch of predictions on some holdout set that the model didn’t train on, and score how happy we are with the result — no derivative or optimization involved. When we try a different model architecture, data augmentation, or any other change to our model — validation loss ultimately decides whether we’re happy with the change or go back to the drawing board.

Most of us have been subject to “validation loss” for a good chunk of our life. But I would argue the education system is very susceptible to reward hacking

If we’re lucky, 1 and 2 might be the same thing. Very often, they’re not. Taking binary classification as an example: it’s very easy to measure accuracy on the validation set: you predict probability on all the items, threshold all the predictions that pass 50% probability (for example), and measure how often you were right. That’s a great, well-defined validation loss, but it doesn’t satisfy the requirement we have in training, that the loss should be derivable. In other words, if you change the model’s weights very slightly, the accuracy is going to stay the same — and that means you can’t optimize for accuracy directly, even though it might be what you really care about.

So, of the 2 losses, which is more important? I would argue the validation loss is the most important. Validation loss is how we decide “model A is better than model B”. This is our lodestar, and it will guide every modeling decision we make. Training loss is a tool, a tactical necessity to lower the validation loss.

So we’d better be damned right in choosing the right goal, right?

### Classification Losses — 3 Common Questions

Let’s stick to binary classification for now, just to have a motivated discussion. You’ve decided you’re trying to recommend Medium posts to people. You’re only going to show the user one recommendation, and they’ll either accept it or ignore you and move on to the next website.

For that, you’re building a classifier:

Classifier(person, article)-> click probability

What should you optimize for?

### Log Loss

The king of classification. This is the loss we usually optimize for in classification model training. As a performance metric, log-loss is a measure of how well calibrated you are in predicting probabilities of a class. In our example, the metric measures how good we are in predicting the click likelihood. If you said that something has a 0% chance of happening and it actually did happen then you’re doing a terrible job in estimating probabilities — the log loss will be infinity.

Intuition: Measure of how good you are in predicting probabilities Edge case: loss goes to infinity if the model predicted probability 0.0 and the label was 1 (or vice versa) Making sense of the number: In binary prediction with 50–50 prior and a clueless classifier, you should see loss=ln(0.5)=0.693. With N many classes and flat prior, you should see loss=ln(1/N)=-ln(N). Log loss is notoriously hard to get an intuition for. A useful trick for binary classification is taking e^(loss) . The number you get is approximately your probability of predicting the right class. Definition: loss=-sum(log(p_i) * y_i) where p_i is your predicted probability for a certain class i, and y_i is the label for that class.

Is this a good measure for a recommendation engine?

Not really. We don’t really care how likely the user is to click on our returned result. We want to put the likeliest article at the top. Predicting the click probability is a related problem, but it’s not the same problem. As an example, maybe we have the signal that a certain user is a click-fiend, and across the board she’s 10x likelier to click anything compared to other users. This information is useless for returning the best result for that user — increasing our prediction 10x for all recommended items doesn’t change their order, but lowers our log-loss, since it impacts our click probability by a lot.

When is this a useful metric?

When you care about the probability of an event happening, as opposed to when you’re ordering recommendations. As an example, let’s say I’m trying to predict the chance of rain, log-loss would be a very useful metric because it quantifies how good a job we’re doing in predicting the probability that it rains.

### Accuracy

This one is quite intuitive. We threshold our results in the way we intend to use our model (for example in our case, we’ll take the top scoring medium article amongst all candidates, rather than threshold), and ask whether that top result got clicked. A more accurate name for what we’re doing in this context would be to call the metric top1 click-through rate, as this sets the stage for use revisiting our product and recommending the top K results instead of just 1.

Intuition: this is a direct measure of “how often you made the right guess”. Making Sense of the number: Accuracy of 99% might sound like amazing performance, until you consider that the flat prediction of “the user isn’t going to click this article” is a prediction that’s already correct 99.9% of the time. A baseline model for accuracy is the appearance frequency of the most common class. In a 50–50 binary problem, this comes out to 50%. But in guessing whether there will be a hurricane today, 99.99% is the absolute worse you can do, so this number must always be compared against a baseline.

Is this a good measure for a recommendation engine?

Yeah, this loss measures the exact use case — whether the top recommendation got picked.

But notice a the weakness of this metric. This metric is quantized, which means even if the model gets better at its job, the accuracy might not move at all. Let’s say our model improved in ranking article recommendations on medium, but the chances of being exactly right and guessing the one article the user ended up clicking out of millions of candidate is very close to 0. In that case, accuracy as a metric wouldn’t move at all even as we improved our model and made better recommendations. Which brings us to the next loss definition.

### AUC

It boggles my mind why people define AUC in a way that’s actively hostile for any human to understand. But just for completeness we’ll start with the dry definition. AUC is defined as Area Under the Curve, which is the integral of the curve that you plot out on a true-positive-rate vs false-positive-rate curve. Here’s a typical visualization

If you ever want a confuse someone, go for the classic AUC definition.

I believe the above definition is useless, because it gives you no understanding of what AUC of 0.9 means. So let’s try another definition, which is mathematically equivalent but I find much more relatable:

In binary classification, AUC of 0.9 means that given a negative sample and a positive sample, 90% of the times your classifier would predict a higher score for the positive sample than it would for the negative sample.

Intuition: Measure of how good you are ordering positive classes above negative classes Making sense of the number: AUC is the likelihood that your classifier will give a higher prediction to a random positive sample above a random negative sample. 0.5 is as bad as it gets! Definition: here

Is this a good measure for a recommendation engine?

Yeah, but let’s think what this is missing. We said at the outset that we only show the top article to the user. So really, this doesn’t capture the way our model is going to get used in this context. On the other hand, unlike top-1 accuracy, this metric is sensitive to minor improvements in the model, and isn’t plagued by the same pathology as accuracy metrics.

And many more For the specific case of recommendation problems, there’s a vast list of metrics that are specifically designed for that case: NDCG, GMAP, MRR and the list goes on. But the purpose of this article isn’t to go deep into the specifics of recommendation engine metrics, but rather to discuss the most common metrics that are useful across the board — and hopefully give a bit of intuition in how to approach the problem of what we measure.

And lastly, a personal observation about measuring progress and outcomes:

As scientists we hate to move the goalposts. It’s much cleaner to have a single test set, with a single metric, and progressively get better at it. That rarely happens.

Reality is more interesting than that. Expect to change your test set, redefine your validation metric, exclude outliers and add new observations to your test set. Expect to move your goal posts until they actually reflect what you want to accomplish.