15 min read

Attempting to Predict March Madness with Machine Learning

Alright folks, here we are. The first of (hopefully) many blog posts, and this one is on a project that has been an absolute blast. For my Advanced Sports Data course, my professor gave us the assignment of using our choice of various machine learning methods to create a model to predict the outcome of men’s college basketball games based on prior data, and use that model to attempt to find the results of the NCAA tournament this year. We were taught how to create the model using various methods, but it was up to us to find the statistics we felt had the best chance of determining the outcomes of games. At the time of writing this, my bracket is actually doing pretty well, so here’s how I did it.

games <- read_csv("cbblogs1521.csv") %>% 
  mutate(
    TrueShooting = (TeamScore / (2 * (TeamFGA + 0.475 * TeamFTA))),
    Possessions = .5*(TeamFGA - TeamOffRebounds + TeamTurnovers + (.475 * TeamFTA)) + .5*(OpponentFGA - OpponentOffRebounds + OpponentTurnovers + (.475 * OpponentFTA)),
    NetRating = (100 * ((TeamScore - OpponentScore) / Possessions))) %>%
  group_by(Team, Season) %>%
  mutate(
    Rolling_Mean_TrueShooting = rollmean(lag(TrueShooting,n=1), k = 4, fill=TrueShooting),
    Cumulative_Mean_NetRating = cummean(NetRating)
    ) %>% ungroup() %>% 
  mutate(
    Location = case_when(
      str_trim(HomeAway) == "@" ~ "A",
      str_trim(HomeAway) == "N" ~ "N",
      TRUE ~ "H"
    ),
    Outcome = case_when(
    grepl("W", W_L) ~ "W", 
    grepl("L", W_L) ~ "L"
    )
  ) %>%
  mutate(Outcome = as.factor(Outcome))

I’m going to walk through some of the code I used, I’ll save you all the pain of having to read through all of it but I’ll go through what I feel is worth knowing about. I’m creating the variables I use to create my model in this block, you can see the formulas for TrueShooting and Possessions, which they don’t track so we have to use a formula to create an accurate estimate. I also use Net Rating, which uses that Possessions metric and creates a rating based on the score differential for every game. You’ll notice the code for Rolling_Mean_TrueShooting and Cumulative_Mean_NetRating, all this means is that I’m using the average true shooting percentage from the last four games, and the cumulative net rating over the course of the season. I felt using rolling means for true shooting allowed for better representation of hot (or cold) shooting streaks and a cumulative net rating allowed for a full season’s worth of evaluation for each team. The rest is just creating metrics we need, like location of games and the outcome of said games, which is pretty important considering that’s what we are trying to predict.

rf_split <- initial_split(bothsides, prop = .8)
rf_train <- training(rf_split)
rf_test <- testing(rf_split)

Alright into the fun stuff. Here I am splitting the data we were provided, which was the game logs of every D1 college basketball game dating back to 2015, into 80% training and 20% testing data. This means our model is going to use 80% of the data provided to fit itself based on the games, their stats, and their outcomes, and we can test it out on the 20% of testing data which the model has never seen before.

rf_recipe <- 
  recipe(Outcome ~ ., data = rf_train) %>% 
  update_role(Team, Opponent, Date, Season, new_role = "ID") %>%
  step_normalize(all_predictors())

summary(rf_recipe)
## # A tibble: 11 x 4
##    variable                           type    role      source  
##    <chr>                              <chr>   <chr>     <chr>   
##  1 Season                             nominal ID        original
##  2 Team                               nominal ID        original
##  3 Date                               date    ID        original
##  4 Opponent                           nominal ID        original
##  5 Rolling_Mean_TrueShooting          numeric predictor original
##  6 Cumulative_Mean_NetRating          numeric predictor original
##  7 TeamSRS                            numeric predictor original
##  8 Opponent_Rolling_Mean_TrueShooting numeric predictor original
##  9 Opponent_Cumulative_Mean_NetRating numeric predictor original
## 10 OpponentSRS                        numeric predictor original
## 11 Outcome                            nominal outcome   original

Here’s where we can see what values we’re using for our model, or our “recipe”. ID roles are for identification of who was playing and when the game was played, predictor roles are for the variables used to make predictions, and the sole outcome role goes to the outcome of games. I use the metrics I mentioned earlier as well as Team and Opponent SRS, which is provided in the data.

rf_mod <- 
  rand_forest() %>% 
  set_engine("ranger") %>%
  set_mode("classification")

We had our choice of a couple different methods to use to create our model, and I chose a random forest. A random forest is a large amount of decision trees that are used to predict the outcome of games based off of basically a bunch of if/else statements. Put simply, if this stat is less than a certain amount, this is the result, if the stat is greater than or equal to that amount, this is the result. Each one of these statements is a “branch” on the decision “tree”. And by using a large number of these decision trees, we create a large number of predictions, and the most common of those predictions gets used. In our case, the prediction is the outcome of the games.

metrics(trainpredict, Outcome, .pred_class)
## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.981
## 2 kap      binary         0.962

After the creation and fitting of our model, here’s how it did against the data it used to train itself. Obviously you expect it to be pretty accurate, because its predicting games that it had the data for, but a 98% accuracy on estimates is still pretty solid.

trainpredict %>%
  conf_mat(Outcome, .pred_class)
##           Truth
## Prediction     L     W
##          L 18206   362
##          W   349 18232

Here’s a confidence matrix, basically showing the cases when our model predicted the outcomes correctly and incorrectly. In the case of losses, it predicted a loss and was correct 18,206 times, but the result was a win only 349 times. Not bad.

metrics(testpredict, Outcome, .pred_class)
## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.740
## 2 kap      binary         0.480

Now for the testing metrics. Remember, the model hasn’t seen this data so we can’t expect it to be as accurate. Still, when used against data it hasn’t seen before, it predicted the outcome of games 74% of the time, which is still pretty solid.

It’s time to start applying our model to this year’s tournament. Now remember, I created this bracket prior to the tournament starting. Let’s start with the play in games.

playinround %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
Norfolk State W Appalachian State 0.4798516
Wichita State L Drake 0.8255373
Mount St. Mary’s W Texas Southern 0.2348040
Michigan State L UCLA 0.8754524

So looking back to the first four games, we predicted Norfolk State had a 53% chance of beating Appalachian State, Wichita State had a 82% chance of losing to Drake, Mount St. Mary’s had about a 77% chance to beat Texas Southern, and Michigan State had an 87% chance of losing to UCLA. Our model got 3/4 of those right (hey check that out, 75% accuracy is pretty close to 74% like I said before), it missed the Texas Southern win though. So now you know how these tables are read, lets start working on the west region.

Before we get too far into this, I wanted to mention that our class motto for our online tournament challenge was “Trust your model, coward”, and I stuck to this. If my model had a team winning by the slightest of margins, I went with it. No gut predictions, no overruling the absurd outcomes, but hey, it’s March, nothing is too absurd, right?

westregional %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
Creighton W UC-Santa Barbara 0.3446714
Southern California W Drake 0.3355889
Kansas W Eastern Washington 0.4604135
Iowa W Grand Canyon 0.2815111
Oklahoma W Missouri 0.4916802
Gonzaga W Norfolk State 0.1672595
Virginia W Ohio 0.1916960
Oregon L Virginia Commonwealth 0.8134913

So looking back, I missed two games in the first round of the west. My model liked the favorites for the most part, so it missed the Virginia-Ohio upset, but also worth noting, the Oregon-VCU game didn’t actually get played, so I could have had that upset right, but hey, we’ll never know. Pretty solid predictions, but nothing too crazy. Just you wait.

westregionalsecond %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
Virginia Commonwealth L Iowa 0.7075484
Southern California W Kansas 0.4111571
Gonzaga W Oklahoma 0.3946532
Creighton W Virginia 0.4878429

For the Round of 32 in the West, I had VCU losing to Iowa, which for all we know, could have happened, but VCU had to get COVID and lose by default, so Oregon went on and ended up beating Iowa. But would you look at that, I was correct on three quarters of the games again, notably the USC upset over Kansas.

westregionalthird %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
Gonzaga W Creighton 0.3309778
Iowa W Southern California 0.4358532

And now we’re caught up to the Sweet Sixteen in the West. Iowa is no longer in the running, so having them going to the Elite Eight hurts, but I have Gonzaga beating Creighton and moving on.

westregionalfinal %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
Gonzaga W Iowa 0.4246421

Well, who would have guessed, my model has Gonzaga winning the West. No surprise I guess, but looking back, it would have been interesting to see how well it would have done if that VCU-Oregon game had happened, but Oregon has been on a hot streak, so it probably wouldn’t have changed anything. Time for the East region.

eastregional %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
Texas L Abilene Christian 0.5379310
Colorado W Georgetown 0.4285944
Alabama W Iona 0.1698706
Connecticut L Maryland 0.5218484
Michigan W Mount St. Mary’s 0.1454325
Florida State W North Carolina-Greensboro 0.1956706
Louisiana State W St. Bonaventure 0.4545198
Brigham Young W UCLA 0.2636048

Not sure why this table decided to be out of order, but right off the bat, there’s something big brewing. My model says Texas has a 53% chance of losing to Abilene Christian. I genuinely laughed when I saw this at first, but hey, trust the model, right? Safe to say I’m glad I did. Other than that, my model was correct on the upset of Maryland over UConn, and only missed one game in the first round, which was the UCLA upset over BYU, which ends up hurting quite a bit. But hey, 7 for 8 in the first round is pretty freaking good, especially with that massive upset over Texas.

eastregionalsecond %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
Brigham Young W Abilene Christian 0.3923778
Maryland L Alabama 0.7643921
Colorado L Florida State 0.7289714
Michigan W Louisiana State 0.3804905

East Round of 32, we’ve got three quarters of the games right again. UCLA beat Abilene Christian in place of BYU, but that just means I had a team that isn’t in the tournament anymore going farther in. Other than that, no surprises.

eastregionalthird %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
Brigham Young L Alabama 0.5938540
Colorado L Michigan 0.7228444

Sweet Sixteen in the east, I’ve got BYU losing to Alabama and Colorado losing to Michigan. No surprises, but looking back, I realized there was a mistake in my original code. I had UConn playing BYU instead of Alabama and losing to BYU because my professor provided the outline for us to work off of, and I forgot to change UConn to Alabama. Well, crap. I fixed it, and now I have Alabama beating BYU which would have put them in the Elite 8 instead of BYU, like my model said. Now I’m hoping UCLA pulls off the upset against Alabama in order to prevent another miss and potential points for my opponents in the class, guess we’ll have to see what happens. Those are missed points for me, but I guess it’s in the past, that’s why double checking is a good idea.

eastregionalfinal %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
Michigan W Brigham Young 0.4736921

Luckily my model has Michigan winning the East. Another 1 seed, but another prediction I’m pretty happy with. Let’s move on to the south region.

southregional %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
Arkansas L Colgate 0.7574349
Baylor W Hartford 0.1316222
Purdue L North Texas 0.5319921
Ohio State W Oral Roberts 0.2043063
Texas Tech L Utah State 0.6377397
Florida L Virginia Tech 0.7230071
Villanova W Winthrop 0.3201913
North Carolina L Wisconsin 0.5957056

The south is where my model didn’t do so hot at first, and is probably the region it did the worst in so far. It did have North Texas beating Purdue, which I was pretty proud about, but it also had Colgate, Utah State, and Virginia Tech upsetting their opponents, which were all misses. For a little bit of background, Colgate was loved by many of my classmates’ models. Statistically speaking, they should have gone deep into the tournament. Unfortunately, Arkansas didn’t let that happen.

southregionalsecond %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
Utah State L Colgate 0.6580722
Villanova W North Texas 0.3512397
Virginia Tech L Ohio State 0.5950071
Baylor W Wisconsin 0.4049024

At this point, I’m basically half wrong already, but at least I predicted the winners correctly for the Baylor-Wisconsin and Villanova-North Texas games. Not much else to say, except maybe some spiteful words for Oral Roberts and Colgate for ruining everything.

southregionalthird %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
Colgate W Ohio State 0.3635714
Baylor W Villanova 0.4004643

All caught up. Like I said, my model, along with many others, liked Colgate a lot. It also has Baylor winning against Villanova, we’ll have to wait and see what happens.

southregionalfinal %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
Baylor W Colgate 0.4273556

Ultimately, Baylor wins the South. Some people in my class had Colgate moving even farther, luckily I was able to prevent any further damage, but like I said, didn’t really do too well in the South overall. To this point, we’ve got the one seeds winning their respected regions across the board. Not for long.

midwestregional %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
Houston W Cleveland State 0.2417548
Illinois W Drexel 0.2225770
Loyola (IL) W Georgia Tech 0.3619310
Oklahoma State W Liberty 0.4632810
West Virginia W Morehead State 0.2275302
Tennessee W Oregon State 0.2052960
Clemson L Rutgers 0.5888278
San Diego State W Syracuse 0.2197651

Time for the Midwest. My model said only one upset, and that’s Rutgers over Clemson, which was correct. Unfortunately, Oregon State and Syracuse had to go and pull off their own upsets, but again, I got three quarters of the games right, not too bad if I say so myself.

midwestregionalsecond %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
Rutgers L Houston 0.7347921
Illinois L Loyola (IL) 0.5471960
Tennessee W Oklahoma State 0.3104111
San Diego State W West Virginia 0.1480587

Here’s where it gets interesting. My model liked Loyola, and rightly so. Down goes the first one seed, and what do you know, it was right. Must have taken Sister Jean’s prayers into account as a predictor variable. Unfortunately, my model liked San Diego State quite a bit, enough so to knock off the three seed that had already lost to Syracuse. Oregon State also had to go and mess everything up again, because my model had Tennessee moving on to the Sweet Sixteen.

midwestregionalthird %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
San Diego State L Houston 0.5743786
Loyola (IL) W Tennessee 0.4593992

It’s at this point I realize I’ve messed up again. In my model, Houston moves on. For some reason on my online bracket, I have San Diego State in the Elite Eight. Well, that hurts considering Houston is still in this thing. Loyola moves on, let’s see how much my error screws me up.

midwestregionalfinal %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
Loyola (IL) L Houston 0.5836317

Ouch. My model says Houston wins. Problem with that, I had Loyola playing SDSU in the Midwest final and winning the region. So, looking forward, this mistake could really hurt me, but it could also help me if Loyola pulls it off. Onto the Final Four.

finalfourresults %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
Baylor W Loyola (IL) 0.3792516
Gonzaga W Michigan 0.4289167

Gonzaga-Michigan and Baylor-Loyola for the Final Four. Still obtainable, and my model has Gonzaga and Baylor facing off in the championship. In case you were curious, if it hadn’t been for my error, Baylor loses to Houston and we have a Gonzaga-Houston final instead.

champs %>% select(Team, .pred_class, Opponent, .pred_L) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed"))
Team .pred_class Opponent .pred_L
Gonzaga W Baylor 0.3790897

My model says Gonzaga’s got a 63% chance of beating Baylor and winning the national championship. It also had them beating Houston, so the final result of my bracket stays the same regardless.

So looking back, I made some human errors that messed up what is already a pretty decent bracket. Prior to today’s games, I’m in the top .6% of ESPN Tournament Challenge brackets and tied for first in the class. I’m now wondering what could have been, and a blog post that was supposed to be bragging about my pretty awesome bracket is now just pointing out that it could have been better. So, here’s where I’m at so far.

https://fantasy.espn.com/tournament-challenge-bracket/2021/en/entry?entryID=48742608

In order to stay in first in my class, I need UCLA and Syracuse to pull off upsets against Alabama and Houston. Not impossible, but not exactly probable either. Reminder, I would have had Houston winning if not for my mistake. I still think my final outcome is very obtainable, and I’m looking forward to seeing what happens. I’m planning on posting an update after this chaos is over, we’ll have to see what happens.

Thanks for reading my first post, appreciate those of you that took the time to make it this far, I wish you all the best of luck with your brackets (although I still want mine to be better), have a good one!

-B