As you may know, we offer real-time conversational insights for inbound calls through our Signal AI technology. When you provide us with labeled training data (call recordings) to build an AI model, we use a machine learning algorithm to identify the patterns in your data. We then apply these patterns to future, unlabeled examples in order to make predictions about them. Part of this process is to test the accuracy range of the model, which we call performance scoring. We recently made significant improvements to our performance scoring and we want to share why we did it and how performance scoring works.

## The Challenge

When you provide us with labeled training data to build an AI model, we use a machine learning algorithm to identify the patterns in your data. We then apply these patterns to future, unlabeled examples in order to make predictions about them. Since we can’t know a priori how accurate this machine learning model will be, we withhold a fraction of the data for testing it: we train the model on 80 percent of the data and then test its performance on the remaining 20 percent. Since we know the correct labels for this last 20 percent, we can compare our predictions with the known labels to assess our model’s accuracy and to estimate how well it will perform in the wild.

Until recently, we quoted three numbers for model performance: the *true accuracy*, defined as the percentage of true calls which we correctly predicted as true, the *false accuracy*, defined as the percentage of false calls which we correctly predicted as false, and the *overall accuracy*, which is the total fraction of calls we correctly predicted, whether true or false.

To understand why these three numbers differ, imagine a very crude “purchase made” model which simply predicts *false* every time: it would have a terrible true accuracy of 0 percent, but a seemingly-excellent false accuracy of 100 percent. Moreover, if purchases occur on only 10 percent of the calls, the overall accuracy would be 90 percent! Though this hypothetical model is arguably not very useful — it would *never* correctly identify a purchase — its performance might look good according to these metrics. This represents a key problem with our old accuracy scores: while they’re technically accurate, they can be hard to interpret.

To provide another example: suppose we have a new (and much improved) “purchase made” model for which the true accuracy, false accuracy, and overall accuracy are all 90 percent — that is, whenever a call comes in, whether a purchase was made or not, our model has a 90 percent chance of correctly predicting its outcome. If you were to listen to 10 calls which the model flagged as purchases, you might therefore expect 9 of them to be correct…but unfortunately this isn’t the case! If the conversion rate is 10 percent, then only about *half* of this sample would be correct!

This unexpected result happens because purchases in our imagined example are inherently rare: for every purchase made, there would be many more calls which didn’t result in a purchase. Even though we have a 90 percent false accuracy in this example, since there are so many non-purchases we’d nonetheless expect a number of false positives. If you run through the math, you’ll see that for every true purchase we correctly identified, we’d also expect roughly one call which was falsely flagged as a purchase. For this reason, *rare events are inherently difficult to identify*, and you need an extremely accurate model to find rare signals.

To look for a signal that only occurs 10 percent of the time, you need a model that’s *better* than 90 percent accurate. If purchases only occur 5 percent of the time, you’d need a model with 95 percent accuracy, and so on. Our old accuracy score did not account for this added difficulty.

Again, our old scores were not incorrect — in fact they are standard accuracy metrics for machine-learning models — but they can be confusing because they differ from what we colloquially think of as “accuracy.” We have developed an improved accuracy score which is a more intuitive representation of our models’ performance.

Another confusing aspect of our old accuracy score is that it can be noisy. Recall that we can’t know the performance of our models a priori: we must estimate it using a fraction of the data withheld for testing. As a result, our accuracy estimates come with some uncertainty, which can be considerable for small datasets.

For example, imagine a customer has a “purchase made” model with an accuracy of 91 percent. Hoping to improve this model, she uploads more training data and is disappointed to see its accuracy *drop* to 89 percent. This might seem surprising, but in fact, her model went from (91±15) percent to (89±7) percent…it very likely *improved* with the added data, but there was no way for her to infer that from our accuracy scores!

## The new performance score in our UI

Rather than quoting a single number for the accuracy, we now quote a range which brackets our uncertainty about the model’s accuracy. This provides a much clearer method to compare two models: if one model’s range overlaps with another’s, we cannot definitively say whether one model is better than the other. Our new interface shows the uncertainty in a model’s accuracy and should make clear whether a change in the score is genuine or whether it’s a statistical fluctuation.

## Formula for the new score

When testing a machine learning model against a test set, there are four possible outcomes: there are *true positives*, or true labels which were correctly predicted as true, *true negatives*, or false labels correctly predicted as false, *false positives*, or false labels incorrectly predicted as true, and *false negatives*, or true labels incorrectly predicted as false.

In our old system, we defined the true accuracy as the fraction of true labels we correctly predicted:

Earlier, we discussed a surprising example where we had a model with a 90 percent true accuracy, but for which only about half of the true predictions turned out to be correct. The reason for this discrepancy is that when we spot-check calls predicted to be true, we’re in fact measuring the fraction of true *predictions, *which are correct, not the fraction of true *sales* we correctly predicted. Symbolically, we’re measuring the quantity:

This is commonly known as the *precision*. Since we assumed sales were rare in this example, there are more falses than trues, and consequently the number of false positives *FP* is greater than the number of false negatives* FN*. Consequently, the true precision *p* is lower than the true accuracy *a*, and the model’s performance on this task is lower than we might have anticipated at first.

Since accuracy and precision are both likely to be important for our customers, we can define a special average of the two:

This metric, commonly called the *F1* score, will be low if either the accuracy *a *or the precision *p* is low, and thus should prevent surprises like the one in our example. (This type of average is known as a *harmonic mean*, and is typically the correct method for averaging rates.)

While the *F1* score is a very useful metric, it only measures the model’s performance on the *true* class. Since we expect that correctly predicting both true and false classes is important for our customers, we therefore quote a new performance metric defined as:

where *R* is the total number of correct predictions and *W* is the total number of incorrect predictions. This is a generalization of the *F1* score to include the performance of the false class, and we believe it provides a good, conservative estimate of a model’s performance, regardless of how you intend to use it.

## Share this: