March is here, and flowers are blooming; so too are the hopes and dreams of fans of the 68 schools participating in this year’s NCAA Tournament: Will this be the year their college basketball team becomes the Cinderella story that dominates water cooler discussions around the country?
It’s March Madness time. For the next month, office productivity will drop as millions of computer monitors are repurposed from examining spreadsheets and building presentations to live streaming games.
The joy of March Madness lies in the nearly limitless possibilities offered by its large-field bracket format and the resultant parity between sports nuts and non-sports fans alike brought about by its inherent randomness. Whether or not you know the Spartans reside in East Lansing Michigan, not ancient Greece, you have an equal shot at correctly picking a crazy upset and earning bragging rights as your friends’ and coworkers’ brackets get busted.
It’s expected that 40 million people will fill out brackets this year. Some will follow expert advice, some will base selections on team mascot cuteness, and others will simply follow the seeds. But for those of us already drawn to sports analytics, March Madness offers an opportunity to double down on analysis, dig into data, and take a quantitative approach to understanding the tournament’s dynamic, which is exactly what we did at Pellucid.
How it was made:
We sifted through a wealth of advanced metrics to specify a statistical model for predicting the outcome of a matchup between any two Division I teams. Using the factors shown in the chart above on a team-by-team basis, the margin of victory for every Division I game played this season was regressed against the differences in these metrics for the two competing teams in those contests. In doing so, we were able to explain over 50% of variance in observed margins of victory, as measured by the final regression’s adjusted R-squared.
The model was applied dynamically to this year’s field and all possible potential matchups, simulating the entire tourney a million times and counting how often each team advanced to each round (with the probability of ultimately winning it all shown on the last axis of our visualization). Our analysis determined the most likely result to be Kansas edging out Michigan State in the finals.
For juicier upset and our full set of picks, check out Pellucid’s “maximum-likelihood” bracket.
How it was calculated:
There are only so many ways to quantitatively predict the outcomes of sporting contests, so most team rating systems produce very similar rankings. The significant correlations among these systems makes it difficult to include them all in a multivariable regression. As such, we focused on metrics that assess differing components of team success, the combination of which holistically summarize a team’s strength and likelihood of winning a given game.
Pomeroy College Basketball
Ken Pomeroy, the godfather of NCAA basketball analytics, produces a set of computer ratings that reflect an expected win probability based primarily on scoring differential. While his Pythagorean wins expectations model is probably the most widely cited metric having to do with March Madness, Pomeroy also decomposes his team rankings into offensive and defensive efficiency metrics. Reflecting the average number of points scored and given up per 100 possessions, these metrics are adjusted for strength of opposition, providing a more accurate picture of the true quality of both the offense and defense of a given team. Finally, Pomeroy also publishes a tempo factor which provides an indication of how many possessions a team would use in a regular game and addresses the general pace of play exhibited by each team.
Value Add Basketball Domino Ratings
A derivative of the Pomeroy ratings, value add ratings look at the value provided by the individual players on the roster. If the Pomeroy rating is a top down, team-focused calculation, think of this as bottom up. The individual value contributions of each player, both offensively and defensively, are aggregated to create a team rating.
Coach Historical Win Percentage
The simplest component, this is the career-to-date winning percentage of each team’s head coach and intended to measure the effect of coaches, such as the impact of legendary coach Tom Izzo on game outcome. The full track record of a coach beyond his or her current coaching job is included, for example, Roy Williams’ 418 wins and 0.805 winning percentage accumulated with Kansas goes into estimating the effect he has on his current North Carolina Tar Heels’ games. The formula for winning percentage is wins divided by total games played.
Sagarin Ratings Predictor Model
Jeff Sagarin’s predictor model is one of the more commonly cited metrics in basketball analytics. His model relies on a margin of victory to determine a pure points estimate for each team and also takes into account the strength of opposition. It also applies a discount to the benefit associated with large margins of victory to reflect the fact that once a team is blowing out another, there is a negligible difference between a 30 point win and a 25 point win.
For additional insight baked into the data we haven’t shown (for example, our simulation shows Kentucky getting screwed over by the draw), email me at firstname.lastname@example.org.
Pellucid blends technology and design to create beautiful, client-ready pitchbook content. Take a demo at www.pellucid.com.