Model Evaluation

An interactive explainer for how to measure model performance.


Created by Lucas Gover

malefemaleless than high schoolhigh school/GEDsome college2-year degree4-year degreeMastersDoctorateProfessional (JD/MD)less than high schoolhigh school/GEDsome college2-year degree4-year degreeMastersDoctorateProfessional (JD/MD)closer to rightcloser to rightcloser to left closer to rightcloser to rightcloser to rightcloser to left closer to left closer to rightcloser to rightcloser to left closer to rightcloser to left closer to rightcloser to left closer to left rightcloser to left closer to rightcloser to rightcloser to rightcloser to leftcloser to rightcloser to rightcloser to rightcloser to left closer to left closer to rightcloser to rightcloser to rightcloser to rightcloser to rightcloser to left closer to rightcloser to rightcloser to left closer to rightcloser to left closer to left closer to rightcloser to rightcloser to rightcloser to rightcloser to rightcloser to rightcloser to left closer to left closer to left closer to rightrightcloser to rightcloser to left closer to rightcloser to left closer to rightcloser to rightcloser to left closer to rightcloser to rightcloser to rightcloser to rightcloser to left closer to left closer to rightcloser to rightleft closer to right
The Premise

Alex, a Democratic candidate, is evaluating three models to determine how to target her door-to-door canvassing effort. Alex surveyed 100 people in her district and is now using this information to evaluate three models.

  • Climate Score
  • Cat Favorability Score
  • Party Score

Click on a score button to change which score voters are arranged by.

Alex's poll asked each person whether they supported her, or her opponent. Blue dots represent Alex’s supporters. Red dots represent people who support her opponent.

Continue clicking on different score buttons to compare each model to the others.

Distribution

The distribution chart shows how many individuals are at each point in the score range.

Validation

Validation charts are useful to see how well a model is measuring the probability a person is a supporter. In a perfect validation, 50% of the people with a score of 50 are supporters, and 25% of the people with a score of 25 are supporters.


We want the validation to show a stairstep pattern, where the % of individuals in the positive class aligns with the score range.

Classification

For Alex, this model is primarily used for classification. Everyone to the left of the line is classified as a supporter, everyone to the right is classified as supporting her opponent.

Alex can make make the classification cut at different points. This universe contains 0% of the population

0% of this universe is classified incorrectly

Move the cursor left to right across the chart to classify the chart at different spots.

Confusion Matrix

To get a complete picture of this model performance, Alex could look at a confusion matrix, to evaluate how this model is performing at different confusion-matrix.

1True Positive1False Positive1False Negative1True Negative

Click around on the graph to view the confusion matrix at different cuts.

Another useful way to look at this is True Positive Rate, and the False Positive Rate. The True Positive Rate is defined as TP / (TP + FN), or in other words, out of all the supporters, how many are currently classified as supporters.

The False Positive Rate measures out of all the non-supporters, how many are classified as supporters, or FN / (TN + FP).

True Positive Rate = 0.10%

False Positive Rate = 0.01%

Click around on the chart to measure the TPR and the FPR at different cuts.

The ROC

Plotting these numbers against each other creates a Reciever Operating Characteristic, or ROC. At each cut, you can plot the false positive rate on the X axis, and the true positive rate on the Y axis. This method to measure how well a score rank orders predictions.

True Positive Rate = 0.10%

False Positive Rate = 0.01%

0%10%20%30%40%50%60%70%80%90%100%0%10%20%30%40%50%60%70%80%90%100%False Positive RateTrue Positive Rate0%10%20%30%40%50%60%70%80%90%100%0%10%20%30%40%50%60%70%80%90%100%False Positive RateTrue Positive Rate

Click around on the chart to plot the TPR and the FPR at different cuts.

Area Under Curve

To measure how well a model rank orders voters, we look at the Area Under the Curve (AUC). For a very accurate model, the ROC will fill the entire graph, scoring an AUC of 1. But if the model isn't performing well, it will only cover about half the chart, with an AUC around 0.5.

0%10%20%30%40%50%60%70%80%90%100%0%10%20%30%40%50%60%70%80%90%100%False Positive RateTrue Positive RateAUC: 0.000%10%20%30%40%50%60%70%80%90%100%0%10%20%30%40%50%60%70%80%90%100%False Positive RateTrue Positive RateAUC: 0.00
01020304050607080901000%0%0%0%Climate ScoreCat FavorabilityParty Score01020304050607080901000%0%0%0%Climate ScoreCat FavorabilityParty Score
Thanks for reading!
If you have any questions or comments about this visualization, get in touch at lucasgover@gmail.com. Check out my other projects here.