In my last post, I introduced the concept of classification trees to predict NHL players in the entry draft. In that post, I made a simple tree that explored a 2-dimensional space to select the best decision shape to predict well using held-out data. In this posting, I’ll be allowing a classification tree to explore all 57 variables in my dataset.
Let’s start with a high-level explanation of what classification trees actually do. How do they decide what to split on? And where? Here’s a diagram to start wrapping your head around the concept:
Our response variable is a chaotic collection of negative and positive class labels. In the context of the image, NHL players might be the green pluses, while non-NHL players might be the red minuses. At each node split, the tree is looking for the single variable that best sorts the NHL Players and non-NHL players into like buckets. After the split, I want to find a more uneven collection of these entities than before I split.
What do we mean by ‘best sorts’? How can I measure this? Luckily, some very smart people created this logic like over a half-century ago. It uses a concept called entropy (typically). In this context, entropy is simply a measure of how much ‘information’ a collection of binary items have. The more intermingled the positive and negative class labels are, the more information they have. Think of it as flipping coins — with a fair coin, there’s no way for me to simplify a way to tell you the result outside of reporting after every flip. With a double-sided heads coin, we both know the result will be heads, so no need to tell you anything. Rule of thumb: the more uneven the class labels are, the less information they hold.
So when I’m looking for the best variable to split on, I’m looking for the one that reduces the entropy of the child nodes versus the parent node the most. Said another way, this is the variable that creates the highest reduction in randomness. Where to split on that variable is simply a matter of finding the point that maximizes this overall information gain of my function.
If left unfettered, a classification tree will simply keep on splitting until it perfectly classifies every observation. In this sense, trees are highly expressive functions, drawing a massive amount of tiny little decision zones in multi-dimensional space. Our trees need to be ‘pruned back’ through methods like cross-validation to ensure they generalize well on new data and aren’t overfitting to our training data.
Classification Tree 2: Full Data
With that little intro out of the way, let’s explore which tree best classifies our data across the 57 original variables! (I hope you’re expecting something incredibly complex and exciting!) Wah wah:
So out of those 57 variables, our tree was only allowed to keep splits involving 4 of them after pruning. The very first split, suggesting the model found it was best at reducing randomness, was a custom variable I’ve created to reflect if someone lead their team in offense. Of course, I’m judicious enough to give someone points if they, say, were damn close to leading their team and spent a decent amount of time away from the top scorer, etc etc. It’s not a variable you can just download from a website and cram into your dataframe. You need to display pretty thorough domain knowledge to understand why a feature should be engineered this way. If it was a variable that was substanceless, a model like this would ignore it. It just so happens that this variable best reduces randomness in our dataset and holds up to strict cross-validation. Yay us.
If you do approach leading your team in scoring, the tree wants to split on normalized powerplay goals, funnily enough. It doesn’t really matter in this case, as everyone getting to this point is going to be labeled an NHL Player regardless of how many PPG they scored. But I suppose someone with a lot of PPG and who leads their team in scoring is pretty damned good place to start identifying future NHL players.
If you don’t lead your team in scoring, you must be hooped, right? Well it turns out you have one last gasp — I’ve also created a custom variable called ‘2nd line scoring’ that tries to find players who may have been buried on a 2nd line but did well given their opportunity. Sometimes you see this on very good, very old, or very top-heavy CHL teams. 24 players in our 447 player training set hit this standard. These are the ‘value bin’ players that you likely have a shot at in later rounds.
So, how did this model do compared to our cute little 2D model from last time? The cross-validated training set had an F1 score of 64.7% — meaning that the average of a) the % of real NHLers it found, and b) the % of right guesses it made was 64.7%. Remember, this is still on unseen data during cross-validation. The model from last time had a score of 59.4%. So allowing the model to explore the entire dataset had an effect of making our model about 5% “better” (better at finding and better at guessing).
Interestingly, though, this model performed worse on the 20% test set, garnering an F1 score of 51.6% versus 62.5% from last time (seen above). Remember, performance on the test set won’t influence the selection of our model, but it will provide some feedback on what kind of performance we might expect on new data. It picked 8 out of 20 actual NHL players, and 8 out of its 11 guesses were correct. Better than a monkey (and several scouting staffs) but this will be (by some margin) one of the weakest performances on our test set throughout the series.
Just for funsies, let’s see how it trains against the entire dataset (the last step I would take before applying it to new players this year).
I’m using the t-SNE two-dimensional plot (again, purely for visualization purposes) so the axis numbers do not mean anything. The dark pink dots are my true positives (so, I predicted NHL player and they actually were), the light pink dots are false positives (I predicted NHL player and they were not, the darker blue dots are false negatives (I predicted non-NHL player but they actually were an NHL player), and the light blue dots are true negatives (I predicted non-NHL player and they weren’t actually an NHL player). You can see that this model is quite shy in taking any guesses in the wide blue section of ‘hard’ guesses. This is where you’ll likely find deeper round gems still available.
So, this model seems to be perfectly fine in identifying the obvious players, but less good at finding anything interesting from the ‘generic’ masses.
Next time, I’ll release a classification tree on a wild exploration of re-oriented dimensional space somewhere in-between the first and second models.