Machine Learning the Draft: Part 1

Hi, internet friends. It’s been awhile! There’s been lots of things happening in Parkatti-land since the last time I looked at hockey stats a few years ago. I’ve furthered my career in analytics, started a family, talked the dog out of mounting young children (for the most part).

Honestly, I haven’t watched too many Oiler games or hockey games of many kinds. I paid attention in Spring 2017 during the Playoff mini-run. I guess I really had to come to grips with something that I loved so much bringing me a lot of stress and bother. But the imprint is still there, burning quietly.

I’ve also spent the last few years getting much, much deeper into data science. It’s not quite to the level of doing Rocky IV dragon-flags in a Siberian cabin, but I’ve certainly grown a lot more confident in the things I’m working on and the tools in my toolbelt.

Recently, I’ve picked up some old datasets I’d worked on for the NHL Draft. I’d spent a decent portion of my life wrapping my head around how you could predict a 17-year-old’s late pubescent development in hockey. It’s kind of fun, looking back at old work with equal amounts of pride and terror. But I’m seeing the world through an entirely different set of eyes these days. I kind of want to see how far I can take it.

So, I’ll be undertaking a series on applying Machine Learning to the NHL Draft. To make this somewhat non-insane given my schedule, I’ll be concentrating solely on CHL forwards for this series. My dataset includes 559 OHL and WHL drafted forwards between 1996 and 2010. During this period, you can find on-ice goal data for junior players. I’ll be drawing on about 6 years of deep thinking and work I’d done (working in tandem with Dark Horse Analytics) to create features for this dataset. In all, there are 57 variables I’d built over time for each of these 559 samples.

The setup is relatively simple — how can we build a prediction machine to use the careers of past players to predict the future careers of current players? I’ll start simply and, with each article, go deeper and deeper down the rabbit hole. If an NHL team out there would prefer this to themselves — my email address is on the top right of the page.

How will I be judging an ‘NHL career’? I’ve thought a lot about this — in the end, I really don’t think the response variable needs to be overly-complicated. All we need to know is whether they had an NHL-worthy career or not — a binary 1 or 0 — to classify good from bad. You could use GVT (represent!), games played, points, # of trips over the blueline, etc etc. Honestly, it doesn’t really matter that much. Give me any evidence they were in the NHL! For this analysis, I’ll classify any player that tallied 75 NHL points within 7 years of their draft year as a “1”, and anyone else as a “0”.

Mind you, this means players like Boyd Gordon and Jason Chimera will be zeroes. Good players, sure, but they flowered later in their careers, and were available for meagre asset prices on the open market. You can buy a 27 year old Boyd Gordon. You generally need to draft your young stars. Why time-bound to 7 years? It boils down to the time value of talent. I want to be good sooner than later — I could draft 7 Martin St. Louis’ this year, but approximately 15% of my current fanbase might be dead before we win that glorious championship. In short: we want good players, now.

Using this definition, my dataset includes 104 NHL players out of the 559 CHL forwards drafted during this time period (about 18.6%). Basically, a feral monkey could have picked an NHL player with one out of every 5-6 CHL forwards selected. This is pretty close to the long-run average of about 1 NHL player per team per draft.

Of course, there’s an upper limit to NHL players you can draft, namely there’s only so much turnover in an NHL roster each year. But just imagine what you could do if you averaged even 2 NHL players every, single, draft. Double the cheap talent, one less overpriced aging superstar you need to bid on, one more asset you can use to fill your weaknesses. Goodness begets goodness.

Today, I’ll close off by showing you what this data looks like. If I could see all 559 players across all 57 variables in one simple diagram, what would that look like?

To accomplish this task, I’ll be using t-Distributed Stochastic Neighbor Embedding (t-SNE) to reduce dimensionality of my dataset, purely for visualization purposes. This technique will make it so that we can see all 57 dimensions in just 2 — an x and y-axis.

Don’t worry about the axis labels — they’re quite useless. What’s important is the shape and relative distances. This is our starting point. When an analyst or a machine begins, they begin here: looking at a bunch of data. Each dot is a player, and each dot’s location is a reflection of the variance contained within each dimension of that player. Similar players clump together, dissimilar players are further apart.

Imagine sitting at a draft table, throwing darts at central scouting’s list. If you’d picked a CHL forward during this time period, you’d have a ~ one in six chance of one of these dots being a real NHL player. Seems like a daunting task, save for the idea that 30 other teams are looking at the same blank slate.

The task is to create colour among the dots — the ability to differentiate between good dots and bad dots separates Stanley Cup winners from perennial losers, quite literally. How do you decide which to pick?

Well, keep in mind that this is past data. We know how the story went.