In my series of posts on building predictive models for the NHL Entry Draft, I’ve explored using classification trees both on a full set of 57 variables and on a 2D fun-sized version of those 57 variables. In today’s post I’m going to explore using a) feature selection and b) dimensionality reduction to transform my data to juice the empirical performance as much as possible. I’m also going to end with a table that shows all 3 of my major models’ performance so far on the test set of past players.
First, let’s do some feature selection. Why select features? Well, I’ve gathered 57 variables that (on an individual or collective basis) may or may not have any relevance in predicting whether teenagers eventually become NHL players. Imagine I had a variable that tracked how many times a player wore a hat during his 18-year-old season — does this have any relevance in predicting his future career in the NHL? I’d like to think so, but in reality, probably not. We need to give ourselves a chance to kick those useless variables out.
There’s a lot of ways to accomplish this, but I’ll be using Recursive Feature Elimination (RFE) via sklearn’s RFECV function. This function considers smaller and smaller subsets of variables and selects the subset that performs optimally during cross-validation, kicking the rest out. In this way, it wants a smaller and smaller number of our variables while maintaining the best possible generalization performance. We have the specify the minimum number of variables desired and the number of variables to kick out at each step. You also have to specify a base model to use for the cross-validation exercise — for anyone who cares, I used a generic support vector machine using a linear kernel to keep things easy.
I’ve created two different subsets of variables: I let the first pick a subset of variables by kicking one crappy variable out at a time — this ended up settling on 31 variables out of the original 57 — taking any more away hurt the model’s predictive performance on unseen data. For the second model, I wanted to get a bit more brutal, and asked it to kick out 16 at a time in large steps — it did this twice, ending with 25 variables out of the original 57.
Now, there’s really no point in feeding these raw subsets to a classification tree, since the trees will only use/split the most important variables anyways. To spice things up, I’m going to linearly project both of these subsets into sequentially smaller dimensional space using a technique called Principal Component Analysis (or PCA). There’s really no easy to to describe what PCA does — finding a series of eigenvectors that maximize the amount of variance found in each of these principal components. Each component is simply a linear combination of my original variables.
Basically, I’m going to transform my data into just one dimension (a line) that reproduces as much of the original data’s variance as possible. Then I’ll do this for 2 dimensions, 3 dimensions, etc all the way up as many dimensions as the original dataset had. Along the way, I’ll be testing these dimensions for which created the best-performing classification tree on my training data to predict unseen data (again, using the F1 Score or a combination of its ability to find all actual NHL players and to make accurate guesses about NHL Players).
The results of applying this PCA transformation to both my 31 variable subset and 25 variable subset and seeing which created the best-performing Classification Trees were as follows:
The best cross-validated F1 score I found was transforming the 25 best features down to 18 dimensional space (or, principal components).
The fact that some hacked-up linear combinations of my original data that are, in a sense, just doing the best they can to mimic the original data can compete with (and sometimes outperform) that original data in a predictive modeling exercise did and does blow my mind. There’s no other way to find this stuff out than playing around and testing it out.
Using this 18D transformed data of the 25 best features, the classification tree parameters that maximized performance during cross-validation were:
- Max Tree Depth: 3
- Min number of samples per leaf node: 4%
- Negative / Positive class weightings: 35% / 65% –> this just “reaffirms” to the model that finding NHL players is more important than finding non-NHL players
Using these parameters, it found a cross validated F1 Score of 61.6% on the training set and 60.0% on the test set. You can compare all three main models I’ve built in the following table:
So, the best training set F1 performance was found on the tree built using all 57 variables from my last post, while the best test set performance was found using my plucky little 2D model built in the first post on classification trees.http://boysonthebus.com/index.php/2019/04/06/nhl-draft-series-part-2-classification-trees-1/. However, the middle-performing model on both the training and test sets was today’s crazy interdimensional hodgepodge. And today’s model actually had the best Recall on the test set of all three models, meaning that it found 60% of the actual NHL players that could be found in the test set.
I won’t show you the tree today’s model actually builds since the 18 dimensions it’s choosing from don’t really mean much (“it split on principal component 4 at -0.21!”). But if I retrain on all 559 data samples, I get the following visualization:
You can see that this model is a little less shy about reaching into the ‘mushy middle’ of generic CHL players in the middle of this chart space. It predicts that some of these seemingly innocuous players will become NHL players, and actually gets some of these predictions correct! Ballsy, right?
This will be the last classification tree I’ll build for this series. Funnily enough, they’re usually one of the least-best performing machine learning models during benchmarking, and that will be no different here. They’re cute, easy to visualize and explain, but ultimately they will be outperformed considerably in future posts.
So, of the three models, what actual, specific predictions are they making on our test set? I’ll end today’s post with a table of the members of the test set, along with their actual NHL career (remember, 1=NHL Player, 0=Non-NHL Player) and the predictions made by the three models so far. You’ll notice I count Jason Chimera as a ‘1’ — I decided to prorate the 2004-2005 lockout season for players that get hosed out of NHL points using the average of the two seasons surrounding the lockout. This bumps Mr Chimera (and a few others) into ‘NHL Player’ status in his 7 post-CHL seasons.
NHL | Name | 2D t-SNE | All Data DT | 18D PCA DT | |
---|---|---|---|---|---|
504 | 0 | Shane Endicott | 0 | 0 | 0 |
408 | 0 | Kent McDonell | 0 | 0 | 0 |
66 | 1 | Jason Chimera | 0 | 0 | 1 |
338 | 0 | Eric Bowen | 0 | 0 | 0 |
233 | 0 | Chris Durand | 0 | 0 | 1 |
440 | 0 | Karel Mosovsky | 0 | 0 | 0 |
498 | 0 | Craig Cunningham | 0 | 0 | 0 |
538 | 0 | Mark Mancari | 0 | 0 | 0 |
355 | 0 | Zac Rinaldo | 0 | 0 | 0 |
507 | 0 | Ladislav Kouba | 0 | 0 | 0 |
211 | 1 | Kris Versteeg | 0 | 0 | 0 |
557 | 0 | Ryan Thorpe | 0 | 0 | 0 |
78 | 0 | Greg Nemisz | 0 | 1 | 0 |
37 | 1 | Andrew Ladd | 1 | 1 | 1 |
103 | 1 | Daniel Paillé | 0 | 0 | 0 |
551 | 0 | Colin Long | 0 | 0 | 0 |
196 | 0 | Dusty Jamieson | 0 | 0 | 0 |
404 | 0 | Colton Gillies | 0 | 0 | 0 |
308 | 0 | Kyle DeCoste | 0 | 0 | 0 |
243 | 0 | Michael Latta | 0 | 0 | 0 |
157 | 0 | Denis Shvidki | 0 | 0 | 1 |
513 | 0 | Josh Beaulieu | 0 | 0 | 0 |
505 | 0 | Anthony Peluso | 0 | 0 | 0 |
534 | 0 | Chris Berti | 0 | 0 | 0 |
249 | 1 | Scott Gomez | 0 | 0 | 0 |
242 | 0 | Brent Gauvreau | 0 | 0 | 0 |
102 | 0 | Spencer Machacek | 0 | 0 | 0 |
389 | 0 | Brandon Segal | 0 | 0 | 0 |
75 | 0 | Quintin Laing | 0 | 0 | 1 |
186 | 0 | Zack Torquato | 0 | 0 | 0 |
346 | 0 | Brett Sonne | 0 | 0 | 0 |
10 | 1 | Bryan Little | 1 | 1 | 1 |
491 | 0 | Brad Voth | 0 | 0 | 0 |
320 | 0 | Ondrej Fiala | 0 | 0 | 0 |
457 | 0 | Shay Stephenson | 0 | 0 | 0 |
512 | 0 | Matt Tassone | 0 | 0 | 0 |
454 | 0 | Kris Hogg | 0 | 0 | 0 |
464 | 0 | Ryan Oulahen | 0 | 0 | 0 |
235 | 0 | Cameron Abney | 0 | 0 | 0 |
96 | 1 | Tyler Ennis | 1 | 0 | 1 |
245 | 0 | Nathan Barrett | 0 | 0 | 0 |
310 | 1 | Tyler Kennedy | 0 | 0 | 0 |
155 | 0 | Anthony Nigro | 0 | 0 | 0 |
401 | 0 | Cody Bass | 0 | 0 | 0 |
533 | 0 | Kris Newbury | 0 | 0 | 0 |
421 | 0 | Jonas Fiedler | 0 | 0 | 0 |
172 | 0 | Alex Hutchings | 0 | 0 | 1 |
333 | 0 | Tom Kostopoulos | 0 | 0 | 0 |
407 | 0 | Adam Berti | 0 | 0 | 0 |
162 | 0 | Chad Hinz | 0 | 0 | 0 |
15 | 1 | Dustin Brown | 1 | 1 | 1 |
508 | 0 | Richard Clune | 0 | 0 | 0 |
490 | 0 | Jordan Nolan | 0 | 0 | 0 |
285 | 0 | James Livingston | 0 | 0 | 0 |
379 | 0 | Shane Willis | 0 | 0 | 0 |
17 | 1 | Evander Kane | 1 | 1 | 1 |
319 | 0 | Kyle Chipchura | 0 | 0 | 0 |
278 | 0 | Brandon Prust | 0 | 0 | 0 |
506 | 0 | Brett Draney | 0 | 0 | 0 |
252 | 0 | Dustin Boyd | 0 | 0 | 0 |
463 | 0 | Andrew Peters | 0 | 0 | 0 |
45 | 1 | Oleg Saprykin | 1 | 0 | 1 |
445 | 0 | Adam Taylor | 0 | 0 | 0 |
519 | 0 | Matt Sommerfeld | 0 | 0 | 0 |
458 | 0 | Petja Pietiläinen | 0 | 0 | 0 |
179 | 0 | Justin McCrae | 0 | 0 | 0 |
236 | 0 | Colt King | 0 | 0 | 0 |
276 | 0 | Shay Stephenson | 0 | 0 | 1 |
175 | 0 | Cory Pecker | 0 | 0 | 0 |
90 | 0 | Corey Durocher | 0 | 0 | 0 |
550 | 0 | Cam Cunning | 0 | 0 | 0 |
46 | 1 | Jordan Staal | 1 | 1 | 1 |
140 | 1 | Eric Fehr | 0 | 0 | 0 |
298 | 0 | Michael Pelech | 0 | 0 | 0 |
21 | 1 | Chris Stewart | 1 | 1 | 1 |
118 | 1 | Kyle Brodziak | 0 | 0 | 0 |
89 | 0 | Sheldon Keefe | 1 | 0 | 0 |
340 | 0 | Eric Beaudoin | 0 | 0 | 1 |
239 | 0 | Stefan Legein | 0 | 0 | 0 |
165 | 0 | Jeff Lucky | 0 | 0 | 0 |
71 | 0 | Rico Fata | 0 | 1 | 1 |
450 | 0 | Ryan Milanovic | 0 | 0 | 1 |
329 | 0 | Garth Murray | 0 | 0 | 0 |
85 | 0 | Kiel McLeod | 0 | 1 | 0 |
250 | 0 | Warren McCutcheon | 0 | 0 | 0 |
555 | 0 | Joey Tenute | 0 | 0 | 0 |
188 | 0 | Cal O'Reilly | 0 | 0 | 0 |
108 | 1 | Jamie McGinn | 0 | 0 | 1 |
393 | 0 | Steven Crampton | 0 | 0 | 0 |
386 | 0 | Devin DiDiomete | 0 | 0 | 0 |
361 | 0 | Brad Twordik | 0 | 0 | 0 |
268 | 0 | Matt Kennedy | 0 | 0 | 0 |
132 | 0 | Derek Dorsett | 0 | 0 | 0 |
311 | 0 | Bryan Cameron | 0 | 0 | 0 |
313 | 0 | Kyle Freadrich | 0 | 0 | 0 |
414 | 0 | Ryan Held | 0 | 0 | 0 |
444 | 0 | Frazer McLaren | 0 | 0 | 0 |
524 | 0 | Jeremy Rondeau | 0 | 0 | 0 |
271 | 0 | Austin Watson | 0 | 0 | 0 |
284 | 0 | Garrett Bembridge | 0 | 0 | 0 |
542 | 0 | Marek Ivan | 0 | 0 | 0 |
312 | 0 | Preston Mizzi | 0 | 0 | 0 |
107 | 0 | Bobby Hughes | 0 | 0 | 0 |
382 | 0 | Geordie Wudrick | 0 | 0 | 0 |
494 | 0 | Scott Cameron | 0 | 0 | 0 |
219 | 1 | Wayne Simmonds | 0 | 0 | 0 |
1 | 1 | Tyler Seguin | 1 | 1 | 1 |
342 | 0 | Anton Borodkin | 0 | 0 | 0 |
159 | 1 | Troy Brouwer | 0 | 0 | 0 |
12 | 1 | Jeff Skinner | 1 | 1 | 1 |
343 | 0 | D.J. King | 0 | 0 | 0 |
142 | 0 | Christian Thomas | 1 | 0 | 0 |