In my last post, I set the stage for this series on using Machine Learning to optimize predictions for the NHL Draft. Specifically, I’ll be concentrating on a dataset of 559 CHL forwards from 1996-2010 over 57 different variables. We ended here:
I created the above plot using t-SNE to reduce my dimensions from 57 down to 2 — a small enough number that can actually be visualized along 2 axes in the above graphic. it gives you a sense of where NHL players can be found among the generic masses of CHL forwards. The axes’ numbering aren’t important — it’s the clustering of NHL forwards that fascinates me. If your prospective player is down and to the right, he’ll be in better and better company.
From this point on, I’ll be using a series of supervised machine learning models to select the one that seems to best perform predictions on unseen data. But first, some general ground rules:
- I’m going to separate my 559 data points into two sample sets: an 80% training set and a 20% test set. The training set will be used to build and select my model, the test set will be used to confirm its performance on entirely unseen data
- For my 80% training set, I’ll be using 4-fold cross-validation to select the best model parameters. Essentially, this will break my 80% training set into 4 different blocks with 20% of my overall sample. We’ll consecutively train a model on three of these blocks and apply it on the last (unseen) block. We’ll build four of these training models, using three blocks to train and applying them on the last block, so that each of the four training blocks is predicted at some point as unseen data. We’ll select the best model by picking the one that has the highest ‘score’.
- The general idea is seen in the graphic above. The green training block that’s held out will rotate four times across all the blocks. Using this kind of cross-validation approach will ensure that the models we’re selected will generalize well on unseen data. We’ll be judging the overall performance and ultimately selecting our model on this cross-validated performance on the 80% training data. The final performance on our 20% test set is just to confirm that nothing went wildly wrong on that set of players.
- So what do we mean by ‘performance’? In this context, my models are classification models, in that they’re trying to bucket the predictions into discrete categories instead of trying to predict some kind of quantity. I’ll be predicting “NHL player” and “Non-NHL player” as 1’s and 0’s — this makes it a classification task. I could have structured this as some kind of regression task by trying to estimate career NHL games played or something similar, but I prefer the cleanliness of just two predicting categories — it makes it easier to communicate and visualize.
- Traditionally, one would use accuracy as the major performance metric, ie how accurate are my predictions that someone will be an NHL Player or Non-NHL Player. In the graphic below, you would calculate accuracy by adding up the green blocks and divide by the total number of samples. But this dataset is imbalanced, meaning that one label is overweight and one is underweight — there are waaaayyyy more non-NHL players than NHL players. I could have a very accurate model simply by guessing that all CHL forwards are non-NHL players (with 82% accuracy!!).
- So, what will we do in this case? I’ll use something called the F1 score. It’s the harmonic mean between a model’s precision and recall. You can see both of these concepts covered in above graphic. Basically, recall measures how many actual NHL players my model accounts for (eg there were 100 NHL players and I found 50 of them), while precision measures how accurate my guesses are (eg I guessed that there were 75 NHL players and, of these, 60 ended up actually being NHL players). The F1 score is simply the harmonic mean of these two percentages. Using F1 instead of accuracy will force my model into concentrating on the positive class labels, in this case on actual NHL players. Yes, avoiding Non-NHL players is important, but at the end of the day, I want this model to suggest who to TAKE, and not who NOT to take. I want it to be a bit aggressive and take some risks, basically.
Classification Tree 1: 2-dimensional
In the very first image of this post, you can see the results of a dimensionality reduction from my original 57 variables into just 2 using t_SNE. The most simple way I can think of attacking this problem using classification trees is to just use these 2 dimensions. Can you look at the graph and pick areas of it as “NHL Player” and “Non-NHL Player”? If a new dot was dropped into it, could you be confident that your zones would correctly classify that player in either bucket? How would you know? How would you know it’s not overfitting?
I’ll use a Decision Tree Classifier to optimize what portion of that 2-dimensional function space should be “NHL Player” and “Non-NHL Player”. How will I optimize this? I’ll tune the hyperparameters of my model using a cross-validated grid search. Essentially, this is a brute-force approach to find the hyperparameters that result in the best performance on unseen data. Things like tree depth, the minimum number of samples in each leaf node, how many features are considered at each node split, etc. I’ll do an exhaustive search of all these model parameters to find the combination that maximizes the F1 score of my 4-fold training set. Here’s the resulting classification tree that it finds using the parameters found during tuning applied against all 447 training examples:
The optimal model hyperparameters found through model tuning were:
- Max tree depth = 4
- Min % of samples in each leaf node = 15%
- Splitting criterion used = entropy gain
- Class weighting on negative / positive lables = 25% / 75%
These parameters ensure the highest performance on unseen data, helping the tree avoid overfitting by pruning it back to what you see above.
The Y-value and X-Value you seen in this table correspond directly to the gobbledy-gook x and y axis seen in the first image of this post. This tree is essentially drawing a shape on that graph space where it says “pick players in this shape” and “ignore players everywhere else”. What do those shapes look like?
In the above graphic, the shaded pink area is where the tree says you should select players. Any player that lands in the non-pink area should be ignored. You can read this shape directly from the tree: Y-values less than 5, X-values greater than 10. In my training set of 447 examples, it would have suggested to take 70 of them. Building trees using the optimized hyperparameters of the model resulted in an F1 score of 59.4% — meaning that the quality of its guesses and its ability to find all the NHL players in the model averaged about 59.4%.
When applying this model to the 20% test set, the results are seen above. It made 12 predictions that CHL players would be NHL Players, and it got 10 of them correct (83.3% precision). Out of 20 actual NHL Players, it found 10 of them (50% recall). The harmonic mean of these two scores is 62.5% — this is the F1 score of this model on the test set. This test set performance of 62.5% is very close to the cross-validated training score of 59.4%, suggesting that it retains its ability to generalize well on completely unseen data.
For one last exercise, I’d take this model one step further before actually using it on new players — I’d retrain the model on ALL of my 559 data points using the same model parameters found above. This is the decision space it finds:
You can see that it uses the extra 20% of samples to learn a new shape for NHL players in pink. I’ve put in a few reference points so you can see where some famous NHL players would have been in this 2-dimensional space. A secondary box has been added in the middle — I think of this almost as the runner-up space — “if there’s nothing available in the far right box, you might want to check out this value bin over here!”.
In any case, this is kind of a toy exercise just to introduce the concept of classification trees and visualize it powerfully using decision regions. I’ve needed to constrain this model to only using 2 dimensions when I actually have 57 to choose from. Next time, I’ll be exploring those extra dimensions.