March Madness 2019

PUBLISHED ON MAR 21, 2019

Well folks it’s that time of year. March Madness. When millions of sports fans around the globe are humbled by just how wrong they can be. Now typically I will print out a bracket and make fly by night predictions based on nothing more than personal bias, but I’m a responsible human being and my alma mater came nowhere near clinching that one coveted ivy league spot so I have not reason no to be strictly objective.

Like most data scientists, I’m a huge fan of Kaggle, a platform where data lovers everywhere compete by submiting predictions on everything from home prices, to cancer detection, to stock market movements. Luckily, Kaggle, with the help of data enthusasts like Kenneth Massey, compiled an extraordinary dataset on everything college basketball dating back to 1985.

So using this data, can we predict outcomes for this year’s competition? Let’s find out.

To get started, let’s decide on the frameworks we want to use.

from sklearn import model_selection, linear_model

Scikit-learn is a popular machine learning library for python.

import pandas as pd

Pandas is helpful data structure and analysis tool.

import numpy

Numpy will help us to manipulate multi-dimensional arrays and matricies.

import math
import csv
import random

Lastly, a few others to help with analysis and utilities.

The Elo rating system was created by Arpad Elo, a Hungarian-American physics professor who wanted a method for calculating the relative skill levels of players in zero-sum games such as chess. In it, the difference in the ratings between two players serves as a predictor of the outcome of a match. A player’s Elo rating is represented by a number which increases or decreases depending on the outcome of games between rated players. After every game, the winning player takes points from the losing one. Seems like a good framework for sports. Let’s write a function for it.

def calc_elo(winning_team, losing_team, year):
  winner_rank = return_elo(year, winning_team)
  loser_rank = return_elo(year, losing_team)

  rank_diff = winner_rank - loser_rank
  exp = (rank_diff * -1) / 400
  odds = 1 / (1 + math.pow(10, exp))
  if winner_rank < 2100:
    k = 32
  elif winner_rank >= 2100 and winner_rank < 2400:
    k = 24
  else:
    k = 16
  new_winner_rank = round(winner_rank + (k * (1 - odds)))
  new_rank_diff = new_winner_rank - winner_rank
  new_loser_rank = loser_rank - new_rank_diff

  return new_winner_rank, new_loser_rank

And here’s a method for quickly accessing a team’s Elo rating for a particular year.

def return_elo(year, team):
  try:
    return team_elos[year][team]
  except:
    try:
      team_elos[year][team] = team_elos[year-1][team]
      return team_elos[year][team]
    except:
      team_elos[year][team] = 1600
      return team_elos[year][team]

We’ll need a way to predict the outcome of a matchup.

def predict_winner(t1, t2, model, year, stat_fields):
  features = []

  features.append(get_elo(year, t1))
  for stat in stat_fields:
    features.append(get_stat(year, t1, stat))

  features.append(get_elo(year, t2))
  for stat in stat_fields:
    features.append(get_stat(year, t2, stat))

  return model.predict_proba([features])

We’ll need a function we can call to update stats and since we do not know how many games we should consider, I’ll leave this as a variable to evaluate at the end.

def update_stats(season, team, fields, n_games):
  if team not in team_stats[season]:
    team_stats[season][team] = {}

  for key, value in fields.items():
    if key not in team_stats[season][team]:
      team_stats[season][team][key] = []

    if len(team_stats[season][team][key]) >= n_games:
      team_stats[season][team][key].pop()
    team_stats[season][team][key].append(value)

To get individual stats, we’ll need a function that returns the average over the last n_games or 0 if it doesn’t exist.

def get_stat(season, team, field):
  try:
    l = team_stats[season][team][field]
    return sum(l) / float(len(l))
  except:
    return 0

Load teams data in order to initialize the team id map.

def build_team_dict():
  team_ids = pd.read_csv(folder + '/Teams.csv')
  team_id_map = {}
  for i, row in team_ids.iterrows():
    team_id_map[row.TeamID] = row.TeamName
  return team_id_map

And lastly, let’s decide on exactly which stats we want to incorporate and build our model using those features and the Elo ratings. From the detailed results data, we can retrieve winning and losing team scores, field goals made, field goals attempted, 3-pointers made, 3-pointers attempted, offensive rebounds, defensive rebounds, assists, turnovers, steals, and blocks. Sounds like a good box score to make predictions from. Putting it all together…

def build_season_data(all_data, n_games):
  for i, row in all_data.iterrows():
    skip = 0

    t1_elo = return_elo(row['Season'], row.WTeamID)
    t2_elo = return_elo(row['Season'], row.LTeamID)
    
    t1_features = [t1_elo]
    t2_features = [t2_elo]

    for field in stat_fields:
      t1_stat = get_stat(row['Season'], row['WTeamID'], field)
      t2_stat = get_stat(row['Season'], row['LTeamID'], field)
      if t1_stat is not 0 and t2_stat is not 0:
        t1_features.append(t1_stat)
        t2_features.append(t2_stat)
      else:
        skip = 1

    if skip == 0:  
      if random.random() > 0.5:
        X.append(t1_features + t2_features)
        y.append(0)
      else:
        X.append(t2_features + t1_features)
        y.append(1)

    if row['WFTA'] != 0 and row['LFTA'] != 0:
      stat_1_fields = {
        'score': row['WScore'],
        'fgp': row['WFGM'] / row['WFGA'] * 100,
        'fga': row['WFGA'],
        'fga3': row['WFGA3'],
        '3pp': row['WFGM3'] / row['WFGA3'] * 100,
        'ftp': row['WFTM'] / row['WFTA'] * 100,
        'or': row['WOR'],
        'dr': row['WDR'],
        'ast': row['WAst'],
        'to': row['WTO'],
        'stl': row['WStl'],
        'blk': row['WBlk'],
        'pf': row['WPF']
        }
      stat_2_fields = {
        'score': row['LScore'],
        'fgp': row['LFGM'] / row['LFGA'] * 100,
        'fga': row['LFGA'],
        'fga3': row['LFGA3'],
        '3pp': row['LFGM3'] / row['LFGA3'] * 100,
        'ftp': row['LFTM'] / row['LFTA'] * 100,
        'or': row['LOR'],
        'dr': row['LDR'],
        'ast': row['LAst'],
        'to': row['LTO'],
        'stl': row['LStl'],
        'blk': row['LBlk'],
        'pf': row['LPF']
      }
      update_stats(row['Season'], row['WTeamID'], stat_1_fields, n_games)
      update_stats(row['Season'], row['LTeamID'], stat_2_fields, n_games)

    new_winner_rank, new_loser_rank = calc_elo(
      row['WTeamID'], row['LTeamID'], row['Season'])
    team_elos[row['Season']][row['WTeamID']] = new_winner_rank
    team_elos[row['Season']][row['LTeamID']] = new_loser_rank

  return X, y

Now let’s load detailed season and tournament results and build all the dataset we need to create our model.

all_data = pd.concat([pd.read_csv(path + '/RegularSeasonDetailedResults.csv'), pd.read_csv(folder + '/NCAATourneyDetailedResults.csv')])

We need to choose an arbitrary number of games to date back in order to get our dataset. Let’s go with 10.

X, y = build_season_data(all_data, n_games=10)
model = linear_model.LogisticRegression()
score = model_selection.cross_val_score(model, numpy.array(X), numpy.array(y)).mean()
print(score)
0.7662501804389397

76.6% isn’t terrible so let’s fit our model and use it to predict future outcomes. The first two games are play-ins between NC State and ND State and St John’s and Arizona State. Let’s grab their IDs and try it out.

model.fit(X, y)
prediction = predict_winner(1113, 1385, model, 2019, stat_fields)
print(prediction) 
0.570812
prediction = predict_winner(1295, 1300, model, 2019, stat_fields)
print(prediction) 
0.796609

Our model predicts Arizona State beats St. John’s with a likelihood of 57% and North Dakota State beats North Carolina Central with a likelihood of 79.7%. Those games were played last night. The final scores were Arizona State over St. John’s 74-65 and North Dakota State over North Carolina Central 78-74. Alright! Two for two!

Let’s quickly check what increasing the number of games to consider for each prediction. A season is no more than 50 games so let’s use 50.

d = {}
for n_games in range(1,50):
  X, y = build_season_data(all_data, n_games)
  model = linear_model.LogisticRegression()
  score = model_selection.cross_val_score(model, numpy.array(X), numpy.array(y)).mean()
  d[n] = score
  scores.append(score)
print(max(stats.items(), key=operator.itemgetter(1))[0])
21

This means 21 is the optimal number of games to consider when predicting. Running this…

X, y = build_season_data(all_data, n_games=21)
model = linear_model.LogisticRegression()
model.fit(X, y)

Pulling games we need to predict and writing our predictions to csv…

stage_two_sample = pd.read_csv(f'{path}/SampleSubmissionStage2.csv')
for index, row in stage_two_sample.iterrows():
  prediction_year = int(prediction_year)
  prediction = predict_winner(int(t1), int(t2), model, prediction_year, stat_fields)
  label = str(prediction_year) + '_' + str(t1) + '_' + str(t2)
  submission_data.append([label, prediction[0][0]])

Finally, let’s fill in our bracket. Here it is :)