In my previous posts, I’ve been using relatively simple tree-based models to visualize how classification works in sorting teenagers into two piles: “NHLer” and “Non-NHLer”. I went to fairly ridiculous lengths to squeeze as much predictive power out of those admittedly cute trees as I possible could. In today’s post, I’m moving on to some of the more heavy-duty models machine learning models, starting with Support Vector Machines.
For now, I’ll concentrate on generating a single straight line in multi-dimensional space that best separates our two piles of desired players (NHL and non-NHL). Support vector machines work by maximizing the margin between two classes of data — I’m basically trying to clear the widest road possible between NHL players and non-NHL players, trying to get the furthest away that I can in getting any of the predictions wrong.
In the above illustrative example in 2-dimensions, the black lines are the furthest-apart parallel lines that can be drawn that still perfect separate blue from red points (think of the colours as NHL or non-NHL). The points upon which these black lines “rest” (aka their Lagrange multipliers are not zero!) are known as the support vectors. Hence: “Support Vector Machines”. Pushing these black line as far apart as possible is a way for us to trust our training data *the least*. We want to be as less wrong as we can be, to give us extra space when unseen data arrives (like a 6’10” second line centre!).
The classification line will be the mid-point between these two lines, here in a dotted yellow. Any new point on either side of this dotted line will then be easily classified as one of our two classes (NHLer or non-NHLer).
In a future post I’ll play with non-linear space, but for now we’ll just stick to these straight, linear classifiers.
In my dataset, I had 559 samples with 57 features. What’s the best straight line I can draw in 57 dimensional space to separate our points? It turns out, a pretty good one. Here’s our optimal model applied to all data points:
On our 4-fold cross-validated training sample, it managed an F1 score of 65.5% on the 4 unseen folds — meaning that its performance to both make accurate guesses about NHL players and find all the actual NHL players in unseen is about 65.5% accurate. Its performance on our 20% test set was about 54.5%, which is a tad lower than I’d hope. Here’s its performance versus our fun little classification trees:
If I was picking a model right now out of what we’ve seen, I’d be selecting this one.
But, I’m a curious sort: I’d like to apply the same inter-dimensional exploration I performed in my last post to this model. After selecting the best 25 features and then projecting them into progressively lower and lower dimensional space using principal components analysis, I’d like to find whichever dimension maximizes this linear support vector’s ability to generalize well on unseen data.
It turns out, the magic number is 17 dimensions — that is, 17 principal components of the original data that best explain its variance. When I do this series of feature selection and dimensionality reduction, we get the following results (when applied to all samples):
This model’s cross-validated F1 score is up to 68.6%. Again, this combines how good its predictions and ability to find actual NHL players are. Compare this to the other models:
So, this new funky linear support vector machine has by far the best cross-validated score on unseen data, though its performance on our final 20% holdout test set is still a bit lower at 58.1%. Its performance on finding actual NHL players is about the same as our original support vector machine, but its ability to avoid false positives (predicting NHL players who don’t become NHL players, or its precision) seems to be superior to the previous model in this post (82% versus 69%). Here is its performance on the test set:
So you can see that it leaves some actual NHL Players behind (11 of 20 to be exact) but only makes 2 wrong guesses out of 11 total guesses of ‘NHL Player’.
I’ve tuned the hyperparameters of this model to ensure the highest performance on unseen data. In this case, only the ‘C’ parameter (which is basically a penalty applied to data on the wrong side of the classification line) and the class weighting (how much emphasis do I put on getting NHL vs non-NHL player predictions correct?) will be tuned. After a ton of exploration, I honed both of these down to the follow ranges, producing a fairly neat little 3d diagram showing the optimal point (in terms of cross-validated F1 score) for both of these parameters (about 0.45 weighting on the 0 class label, and a C of 0.0084):
To finish this post, here is a table that compares the NHL status of my test set with the labels predicted by this LinearSVC model.
NHL | Name | 17D LinearSVC | |
---|---|---|---|
504 | 0 | Shane Endicott | 0 |
408 | 0 | Kent McDonell | 0 |
66 | 1 | Jason Chimera | 1 |
338 | 0 | Eric Bowen | 0 |
233 | 0 | Chris Durand | 0 |
440 | 0 | Karel Mosovsky | 0 |
498 | 0 | Craig Cunningham | 0 |
538 | 0 | Mark Mancari | 0 |
355 | 0 | Zac Rinaldo | 0 |
507 | 0 | Ladislav Kouba | 0 |
211 | 1 | Kris Versteeg | 0 |
557 | 0 | Ryan Thorpe | 0 |
78 | 0 | Greg Nemisz | 0 |
37 | 1 | Andrew Ladd | 1 |
103 | 1 | Daniel Paillé | 0 |
551 | 0 | Colin Long | 0 |
196 | 0 | Dusty Jamieson | 0 |
404 | 0 | Colton Gillies | 0 |
308 | 0 | Kyle DeCoste | 0 |
243 | 0 | Michael Latta | 0 |
157 | 0 | Denis Shvidki | 0 |
513 | 0 | Josh Beaulieu | 0 |
505 | 0 | Anthony Peluso | 0 |
534 | 0 | Chris Berti | 0 |
249 | 1 | Scott Gomez | 0 |
242 | 0 | Brent Gauvreau | 0 |
102 | 0 | Spencer Machacek | 0 |
389 | 0 | Brandon Segal | 0 |
75 | 0 | Quintin Laing | 1 |
186 | 0 | Zack Torquato | 0 |
346 | 0 | Brett Sonne | 0 |
10 | 1 | Bryan Little | 1 |
491 | 0 | Brad Voth | 0 |
320 | 0 | Ondrej Fiala | 0 |
457 | 0 | Shay Stephenson | 0 |
512 | 0 | Matt Tassone | 0 |
454 | 0 | Kris Hogg | 0 |
464 | 0 | Ryan Oulahen | 0 |
235 | 0 | Cameron Abney | 0 |
96 | 1 | Tyler Ennis | 0 |
245 | 0 | Nathan Barrett | 0 |
310 | 1 | Tyler Kennedy | 0 |
155 | 0 | Anthony Nigro | 0 |
401 | 0 | Cody Bass | 0 |
533 | 0 | Kris Newbury | 0 |
421 | 0 | Jonas Fiedler | 0 |
172 | 0 | Alex Hutchings | 0 |
333 | 0 | Tom Kostopoulos | 0 |
407 | 0 | Adam Berti | 0 |
162 | 0 | Chad Hinz | 0 |
15 | 1 | Dustin Brown | 1 |
508 | 0 | Richard Clune | 0 |
490 | 0 | Jordan Nolan | 0 |
285 | 0 | James Livingston | 0 |
379 | 0 | Shane Willis | 0 |
17 | 1 | Evander Kane | 1 |
319 | 0 | Kyle Chipchura | 0 |
278 | 0 | Brandon Prust | 0 |
506 | 0 | Brett Draney | 0 |
252 | 0 | Dustin Boyd | 0 |
463 | 0 | Andrew Peters | 0 |
45 | 1 | Oleg Saprykin | 0 |
445 | 0 | Adam Taylor | 0 |
519 | 0 | Matt Sommerfeld | 0 |
458 | 0 | Petja Pietiläinen | 0 |
179 | 0 | Justin McCrae | 0 |
236 | 0 | Colt King | 0 |
276 | 0 | Shay Stephenson | 0 |
175 | 0 | Cory Pecker | 0 |
90 | 0 | Corey Durocher | 0 |
550 | 0 | Cam Cunning | 0 |
46 | 1 | Jordan Staal | 1 |
140 | 1 | Eric Fehr | 0 |
298 | 0 | Michael Pelech | 0 |
21 | 1 | Chris Stewart | 1 |
118 | 1 | Kyle Brodziak | 0 |
89 | 0 | Sheldon Keefe | 0 |
340 | 0 | Eric Beaudoin | 0 |
239 | 0 | Stefan Legein | 0 |
165 | 0 | Jeff Lucky | 0 |
71 | 0 | Rico Fata | 1 |
450 | 0 | Ryan Milanovic | 0 |
329 | 0 | Garth Murray | 0 |
85 | 0 | Kiel McLeod | 0 |
250 | 0 | Warren McCutcheon | 0 |
555 | 0 | Joey Tenute | 0 |
188 | 0 | Cal O'Reilly | 0 |
108 | 1 | Jamie McGinn | 0 |
393 | 0 | Steven Crampton | 0 |
386 | 0 | Devin DiDiomete | 0 |
361 | 0 | Brad Twordik | 0 |
268 | 0 | Matt Kennedy | 0 |
132 | 0 | Derek Dorsett | 0 |
311 | 0 | Bryan Cameron | 0 |
313 | 0 | Kyle Freadrich | 0 |
414 | 0 | Ryan Held | 0 |
444 | 0 | Frazer McLaren | 0 |
524 | 0 | Jeremy Rondeau | 0 |
271 | 0 | Austin Watson | 0 |
284 | 0 | Garrett Bembridge | 0 |
542 | 0 | Marek Ivan | 0 |
312 | 0 | Preston Mizzi | 0 |
107 | 0 | Bobby Hughes | 0 |
382 | 0 | Geordie Wudrick | 0 |
494 | 0 | Scott Cameron | 0 |
219 | 1 | Wayne Simmonds | 0 |
1 | 1 | Tyler Seguin | 1 |
342 | 0 | Anton Borodkin | 0 |
159 | 1 | Troy Brouwer | 0 |
12 | 1 | Jeff Skinner | 1 |
343 | 0 | D.J. King | 0 |
142 | 0 | Christian Thomas | 0 |