TL;DR Using probability theory, the conclusion that Koskinen was a non-starter-quality NHL goalie could have been reached by game 37 of his career, shortly after he signed a huge deal at game 31.
This week I’ve gotten myself obsessed with Mikko Koskinen and the recent start of his 3-year $13.5 Million contract. I’ve long had a fascination with goalies (being a crude approximation of one myself) and especially about when you can tell that they’re either bad or good. Most goalies are held pretty skeptically for the first 3-5 years of their careers. The question of: “Yeah, he was good last year, but is he actually *good*” is (or at least should be) something that fans ask themselves about goalies in the early stages of their NHL careers.
Some history…
One of the early advanced stats axioms in hockey was that you needed about a 2500-shot sample size before getting a sense of a goalie’s actual talent level. I have no idea who came up with that originally, but I’ve always adhered to that general thought. 25 shots-ish in a game –> about 100 games before putting the thumbs up or down.
But that’s nagged at me a bit. If you wanted to wait 100 games on Koskinen, the player might be 32 before you’re willing to call it either way!
A few years ago I did a Lunchalytics presentation on trying to find a shot limit sample that balanced exploration and exploitation — giving the goalie enough time to prove himself but not too much that you can’t move on without torpedoing an entire season.
It was a surprisingly hard line to find and the conclusion must have being unsatisfying enough that I can’t remember it off-hand. The 2500-shot barrier is still stuck in my head.
A new (and hilariously simple) approach
I was struck recently with a better/easier way to think about/visualize goalies and shots and goals — basically, thinking of a shot as a coin flip. Or, more accurately, a weighted coin flip. Or, mathematically, a Bernoulli trial. I’m sure someone has had this thought before (and likely written about it), but it was novel to me anyways.
Every time a shot hits the net, imagine a coin being flipped that’s heavily weighted to one side — so much so that it has a ~91% chance of landing on heads (save) and only a 9% chance of landing on tails (goal). Flip, flip, flip, etc. 25 shots in a game, flip, flip, flip. How many of those flips become tails (or goals)? Well, on average, about 2.25 I suppose (25 * 0.09). But some nights a goalie will let in 0 goals on those 25 shots. Some nights, he might let in 10. Hockey is a wild & chaotic game!
And how would you know the shot actually was weighted at 91% save? Probably because I told you, sure, but how would you test whether you *believed* it was weighted at 91% likelihood of not going in? How do I know a goalie is a true 0.910 goalie?
String together enough coin flips (or Bernoulli trials) and you’ve got yourself a new distribution — what’s known as a binomial distribution. In English, the binomial distribution will tell us how many successes you’d expect after a certain number of trials (E[save] = 25 shots * 91% chance of save= 22.75 saves) — but it also can tell us about the variability around that expectation (Var[save] = 25 shots * 91% * 9% = +/- 2.05 saves) that shapes the curve around the mean expectation of 22.75 saves.
The chart above is the probability mass function for this particular binomial distribution. If a goalie really is a 0.910 talent goalie, the probability of him recording 25 saves in a 25-shot game (better known as a shutout) is actually 9.5%. He’s most likely to give up 2 goals (27.8% of the time), then 1 goal (23.4%), then 4 goals (11.5%), etc. It may look like everything before 17 saves is 0.0%, but I can assure you that there is a tiny probability attached to each number of saves — there is a 0.00000000000000000000000072% chance of a 0.910 talent goalie letting in 25 goals on 25 shots. The 10 goal scenario on 25 shots I mentioned above? It has a 0.0028% chance of happening every game, or about once every 36,105 games played (or once every 28 years given the current NHL schedule and there actually being a 0.910 goalie in net every game [hello Peter Ing]).
So what’s the point of all this? Well, theoretically, each experiment like this gives us a chance to setup a hypothesis test. We set the confidence level we want (let’s set alpha=0.05 or the classic 95% confidence level) and the thing we want to test (Null Hypothesis: He’s a 0.910 goalie, Alternative Hypothesis: he’s worse than a 0.910 goalie). Now all I have to do is add up the scenarios that equal about 5% of potential observations on the low-end. In turns out that the span of 0 saves – 19 saves encapsulates about 2.1% of all possibilities (or closest to our desired 5%). And that’s the entire test — if a goalie lets in 6 or more goals in a 25-shot game, we’re 97.9%+ confident that he’s worse than a true 0.910 goalie.
One game can tell us this, in theory. Of course, it’s not that clinical in real life. Goalies get gassed, sustain slight injuries, let embarrassment wash over them, etc and can implode on a 1 game scale. But I’d posit that applying this test to samples a bit larger than 1 game should really start to tell us something interesting.
Get to Koskinen please…
So why was I thinking about this in relation to Koskinen? Well, the Oilers signed him to a 3-year extension worth $13.5 Million on Jan 21, 2019. This was after his 27th game of the 2018-19 season and the 31st of his career. That’s a fairly significant commitment to make to someone after 31 career games. And doesn’t that beg the question — was there any way to know after that few number of games that he was worth it? Or could we have proven that he wasn’t? If you couldn’t prove it either way yet, how long would it have taken?
Establishing a baseline ‘starting goalie’
The first thing we’ll have to do in this series is establish a baseline level of performance for a ‘starting’ NHL goalie. If you’re paying $4.5M per year for a goalie, he better be capable of being a decent starter in the NHL. Why do I need this? To perform the tests above, I need to start with a level I want to test against. In the toy example above I used 0.910 for simplicity — this is likely a level too low for today’s NHL starter. Whatever SV% I select will frame our testing with three conclusions — a) is this goalie better than that level, b) worse than that level, or c) we can’t prove he’s either better or worse than that level.
Picking this desired level of SV% is an inherently subjective exercise, so I’ll just simply state what I did and you can decide to agree or disagree.
Since the 2012-2013 lockout, there have been 62 goalies that have been granted the opportunity to face 2500+ shots in the NHL. These are all either starting goalies at one time or long-time trusted backups. Of these 62, I’m going to strip out the 20 goalies that have the worst SV% over that time — the dividing line is Jimmy Howard & Karri Ramo on the good side, Al Montoya & Jacob Markstrom on the bad side. I assure you that most of the 42 are pretty uncontroversial inclusions. So, I’m left with 42 goalies that constitute some image of what a ‘decent starting goalie’ looks like in the NHL. This cohort ranges between 0.911 and 0.922 since 2012-13.
Next, I’m going to calculate the combined SV% of these 42 goalies since the 2012-13 lockout. Turns out that this level is about 0.9162. This is going to be a pretty key number throughout this series, so let that sink in a bit. I’m implicitly saying that I want to know if a goalie shows evidence of being around that talent level or not. If he’s worse than that, then cut bait and try again with another goalie. If he’s better, lock him up at a reasonable price for a long, long time.
The Koskinen Experiment
With that, we have all we need to setup the Koskinen Binomial experiment. We’re going to look at his career-to-date SV% by game and test what the probability of a true 0.9162 SV% goalie posting those results (or worse) are — for math nerds: I’ll be finding where each point rests on his cumulative distribution function.
Let’s look at the above chart. In his first career game in Feb 2011, Koskinen posted an 0.808 SV% or 21 saves on 26 shots — the point on the chart above is 6%. What does the 6% mean? It means that if a true 0.9162 goalie were to face 26 shots a massive number of consecutive times, he would only let in 5 goals or more 6% of the time by mere chance. After his first game, you can see how his 59-game career to date has progressed, applying this test after each game. Koskinen’s results peaked after career game 20, where a true 0.9162 goalie would have posted his career results or worse 60% of the time by mere chance. So when do you think the Oilers signed Koskinen to this contract?
Here’s the timeline: the Oilers signed him after career game 4, when a true 0.9162 goalie would have posted his career results or worse in only 7% of all alternate universes. After career game 31, they signed him to the huge new deal when a true 0.9162 goalie would have been putting up his results or worse in only 14% of all alternate universes. After career game 37, they traded away Cam Talbot, when Koskinen’s career-to-date 0.905 or worse would have been posted by a true 0.9162 goalie only 5% of the time. By season’s end, Koskinen’s career 0.904 SV% would have only been posted by a true 0.9162 goalie by chance 3.6% of the time.
When could we have known what Koskinen was? All we have to do is overlay our hypothesis zones:
- If a goalie gets above 95%, that means that a true 0.9162 goalie would post those results or better 5% (or less) of the time. It’s a rare event — rare enough that we can reject the null hypothesis that he’s a 0.9162 goalie and accept the alternate hypothesis that he’s better.
- If a goalie gets below 5%, the opposite is true — a true 0.9162 goalie would only post those results 5% (or less) of the time. A likewise rare event — rare enough that we have enough proof to suggest he’s worse than a true 0.9162 goalie.
- If he’s between 5% and 95%, we don’t have enough evidence either way, so we fail to reject the Null Hypothesis that the goalie isn’t better or worse than a true 0.9162 goalie. Here that is visualized
You can see that through Koskinen’s career, he’s never really gotten close to having proof of being better than 0.9162 — he’s way far away from the top rejection zone. However, he has hugged the bottom rejection zone for most of his career. He dipped under 10% in his first 5 games, touched 5% probability briefly after his 37th game, and ducked back under 5% for good after career game 56 on March 30, 2019.
Conclusion
We had ample evidence that Koskinen was not a career 0.9162 (or decent starter) by game 37. We had very clear evidence that he wasn’t by the end of the season.
And that’s really the point of this article. When the extension was signed, you at least had enough evidence to doubt that he was a starting-quality goalie. There was no reason to sign that deal in-season. If you feel that he really is a starter — let him prove it over the remaining 28 games he would play that year. If they had waited, they would have been able to witness his performance degrade to where it did by season’s end and made the proper judgement — that Koskinen is not a starting-quality goalie.
It’s a $13.5 million mistake. Well, to be fair, more like a $10.5 million mistake, since you’d have to pay a backup $1M/year anyways, but still.
This is simply one example of where having an analytics department could assist the Oilers. This one example would have saved them from making a $10.5 million mistake. How many more good decisions could be assisted, or bad decisions avoided? It’s not a stretch in the slightest to suggest that tens of millions of dollars are at stake. The knowledge is out there and eager to help.
I’ll be continuing this series soon, looking at cohorts of good and bad goalies and when we could have applied this test to reach those conclusions. I’ll also be looking at Free Agency 2019 and suggest which goalies were the best bets. I’ll also have a look at each team’s starting goalies and see what evidence they have of being a 0.9162 talent goalie.
1.”One of the early advanced stats axioms in hockey was that you needed about a 2500-shot sample size before getting a sense of a goalie’s actual talent level.”
It’s more appropriate to use “trials”; there is no “population” and no “samples”. Probability theory has no predictive pretence and is not strictly speaking a measure of talent; it simply states the probability of whether or not a result produced by a given set of trials has been arrived at by chance. It essentially has no predictive value in and of itself – any predictive value we assign to it is an opinion – a narrative construct that we lay on top of the numbers. “BRODEUR IS A FRAUD”.
2. “I was struck recently with a better/easier way to think about/visualize goalies and shots and goals — basically, thinking of a shot as a coin flip.” A two sided-coin, or a three-sided coin? Or maybe one of those D&D geek dice?
https://www.coppernblue.com/analysis-5/2015/10/5/9452123/ten-years-after-the-shootout
3. “Next, I’m going to calculate the combined SV% of these 42 goalies since the 2012-13 lockout. Turns out that this level is about 0.9162.”
I think you really have to look at the combined SV% of Oilers goalies since the lockout, or go back to 2007. I would not look across the league. Really, we have had sub-standard defence for that long – well longer. We destroy goalies. Try as he might Hitch was unable to get the Oilers to change their style of play very much. We play the exciting bad hockey we’ve always played (except for perhaps that magical Pronger year where we weren’t exciting but everyone was excited).
4. Your p-values will differ depending on the event space and the number of trials. What are you considering to be a trial? A Shot? A Fenwick? A Corsi? How many variables in the event space? Two or three or four? What kind of distribution will that give you? Binomial, trinomial?
shot = [save, goal]; Fenwick = [miss, save, goal]; Corsi=[block, miss, save, goal].
Each has a different event space, will produce different SV% and use a different distribution.
5. I believe that Talbot is currently a lower-risk signing than Koskinen, and that Talbot is going to have an explosive year in CGY – opinion based on what I see in the numbers. Koskinen’s results in his first season with the Oilers were, however, slightly better than Talbot’s. And this while he was behind a worse group of defenders (see the above link).
https://www.coppernblue.com/2019/6/29/19303291/ufa-goalies-by-the-numbers-edmonton-must-pursue-talbot
Mikko is an Allaire’s wet dream. He has a serious statistical advantage – or rather would have in the dead puck era where Kiprusoff was elite. But he doesn’t seem to have the mobility of other goalies, like Talbot. Mike Smith has both attributes (big body and very athletic) – and if he can pull it together mentally could still put up some good numbers. But now Mike has to deal with an Oilers D in front of him that makes piles of mistakes, doesn’t read plays well, and is generally playing at least one notch too high. Our defence is going to spend at least 10-20 games trying to get used to Mike Smith’s active puck play, if we get used to it at all.
1. Yep. It’s why we have hypothesis testing — some arbitrary threshold to accept or reject something. Doesn’t mean someone IS DEFINITELY not a starter — just likely not one. There’s a difference there.
3. While I’d agree that not every shot is the same, over longer samples this should even out. The Oilers would test that limit though, you’re right :). Dubnyk had some very reasonable years here, Scrivens sustained some hotness, and on the whole we’ve kind of employed marginal talent in net for a long time. Our poor defence is a factor though, and that context would be an enhancement to this very simple model.
4. For this, a shot on net is a trial. Only two outcomes: save or goal.
5. I agree that Talbot is a better bet. Likely a much better one. I’ll touch on that in this series at some point.
Also… I’m assuming that 0.9162 is based on 5v5 or even-strength?
Actually that 0.9162 is all situations. For this analysis I kept this as simple as possible just to introduce the method and output high-level results. Of course inordinate team PP’s against would effect this assumption as 4×5 SV% goes down significantly. Over the long run I’d expect that to even out, but in short time-frames concentrating on EVSV% would make sense.
Hey.
Interesting.
Does all this assume that, in the long run, all shots are shots and each team gives up a similar set of high/low quality shots per game?
Erik
Yep — I kept this as simple as possible. Shots are considered generic trials without taking into account any context. That context would be an obvious enhancement to this kind of analysis. Though I would say that the law of large numbers should take over at some point and that danger would even out somewhat.
Koskinen was 11th in the NHL in shutouts. If you ran the analysis with good rather than bad games he’s certainly look like a starter (at least not not like a starter). No?
The thing I’d most worry about with using games as a trial versus shots is that there are so few games in comparison (1/25th-ish). One thing you could do is try to prove that Koskinen had a higher number of shutouts than you’d expect, I suppose. But in the grand scheme of things, is a shutout and a blowout better or worse than simply two average games? From my perspective, I’d only really care about the aggregate number of shots/saves…
Can you please run the Analytics Department for the Oilers?