Using Data to Predict the SuperBowl 50 Winner

By Shiraz Asif posted 02-05-2016 01:14 PM

Like

It’s that time of year again! SuperBowl 50 is upon us, and this year the Bay Area is alive with excitement. Unfortunately it’s not because of our beloved Niners, but this game has the makings of a classic. The Denver Broncos against the upstart Carolina Panthers. The legend Peyton Manning against the up and coming new kid on the block, Cam Newton.

Last year, we used BigQuery to make a prediction about who would win the game. Our prediction of the Seahawks winning was ultimately wrong, but the data we used didn’t factor into consideration the impact of deflated footballs :).

The focus of last year’s post was to showcase the capabilities of BigQuery, so we used the “Rank” function in BigQuery to evaluate both teams against each other in various offensive and defensive statistics. The one that “ranked” higher in more statistics was predicted to be the winner.

This year, we’ll try something a little different and perhaps a bit more scientific. Not only will we predict an outcome, we’ll also predict a score – based on the data. We’ll even throw in an interesting wrinkle – the excitement of the fans! Read on to find out what that means.

Now don’t go off and start betting all your hard earned money based off our prediction. Analytics is supposed to be directionally correct, but not absolute? Besides, maybe some devious no-good cheater will lower the uprights this year and throw my thought process off completely!

What does the data suggest? Let’s start by comparing the two teams side by side:

The Panthers seem to have an edge, but I don’t know if I would be too quick to write off Peyton Manning. The man is a living legend and brings with him a ton of intangibles. Here’s his chance to ride off into the sunset as a SuperBowl champion. He will be motivated. The question is: Will that be enough to overcome the juggernaut that is Cam Newton and the Carolina Panthers? Let’s see what the data says.

The Prediction

In order to build a prediction, we started by downloading data for all games of the current season and playoffs-to-date from http://www.pro-football-reference.com as a CSV file.

For the purpose of this simple example, we used Excel. I thought of leveraging R, but this post would have turned into a how-to of R and lost focus, so I’ll leave that for a future post.

Once opened in Excel, the data looks like: (below is just a small sample of the full dataset)

To keep things simple, we’ll only use Columns B through E.

PtsW (Column D) is the number of points for the winning team.
PtsL (Column E) is the number of points for the losing team

The first step in formulating a prediction was to determine a league wide reference from which to compare these two teams. We calculated the Average Points Scored by the Winning Team by summing the total number of points scored by the winning team and dividing by the number of games played. We then repeated this step to also calculate the Average Points Scored by the Losing Team. The calculated averages were:

Avg. Points Scored by Winning Team: 28.29
Avg. Points Scored by Losing Team: 17.22

This shows that on average, the winning team scored ~28 points, and allowed ~17 points. These averages serve as a reference point from which we can determine the relative performance of both the Panthers and the Broncos. The next step is to isolate all the games for the Panthers and Broncos only, and then calculate the Avg. Points Scored and Allowed for the Panthers/Broncos only. I ended up with the following:

This is a really great starting point because we can easily see that the Panthers are a far better offensive team than the Broncos, but also better than the league average of 28.29. It also shows the Broncos are the better defensive team.

Now that we have both the league and team specific averages, we should be able to determine how this average can be used to determine a relative performance index.

We decided to build out an Offensive and Defensive Strength Index by taking the team averages and dividing them by the league averages. This resulted in the following table:

*Note: A higher offensive strength index indicates more Offensive Strength. The opposite holds true for Defensive Strength. The lower the index, the greater the team’s Defensive Strength. In other words, it is expected that a team will score less playing against a team with a lower Defensive Strength index, as compared to playing against one with a higher index.

The above supports our assertion that the Broncos are a better defensive team, while the Panthers are a far better offensive team.

Defense wins championships right? Or is the best defense a good offense? Which is it? Let’s see which metric proves to be stronger in our prediction.

The next step is to determine a way to use this performance index to calculate a predicted result. In order to do this, we will use the following formula:

Broncos predicted score =
Broncos Offensive Strength X Panthers Defensive Strength X Broncos Average Points Scored

Panthers predicted score =
Panthers Offensive Strength X Broncos Defensive Strength X Panthers Average Points Scored

This formula helps us weigh the two teams’ relative performance over the course of the year against each other and come up with a prediction.

While this is may be an overly simple way of looking at this, the point of the post is to highlight a simple yet sound data driven mechanism to formulate a prediction. It’s often best to just keep things simple, both in sports predictions and especially in business when we are using data to make educated decisions, and forecast future performance.

So let’s figure out who will win!

Broncos predicted score =
0.78 X 1.12 X 22.11

Panthers predicted score =
1.14 X 1.06 X 32.22

Based on the calculation above, our prediction is…
(drum roll please)

The Panthers will win SuperBowl 50!

Broncos: 19
Panthers: 39

Probability Matrix

Wait…there’s more. Our prediction can’t possibly stop there! The data shows the Panthers will win, and by quite a resounding margin, but what is the likelihood of this result? To answer that question, we turn our attention to Data Science and leverage the Poisson Distribution to check the probability of the above result. When laid out on a matrix of probable results, we end up with a rather large monstrosity of results.

In the diagram below, the X axis at the top represents the possible points scored by the Broncos, and the Y axis on the left represents the possible points scored by the Panthers.

I could have included the entire matrix of probabilities but nothing would be legible, so I zoomed into the matrix to highlight the most probable results as follows:

If you look at our predicted score of 39-19, you can see that it has a 57.9% probability of being accurate, which is the highest probability in the matrix. After the game is over, you can come back to this post and look up the score to determine what the actual probability was!

We derived this result by comparing the relative performance of the two teams against each other. It will be interesting to see how this model holds true for previous SuperBowl’s but I won’t go into that today.

Fan Sentiment

In sports, another factor that is often overlooked is the excitement of the fans, which I’ll call the “fan hype factor.” The hype of the fans often adds an intangible element to the game, either adding tremendous pressure to the athletes by setting unrealistic expectations, or in the case of a team that is underdog, a feeling of unwavering support. Ask any athlete – you can’t underestimate the power of the hype factor.

Instead of taking a standard approach to seeking out fan opinion, we tapped into the one place where avid fans massively express their unsolicited opinions online: social media!

We wanted to take a closer look at the fan-generated conversation surrounding the two teams by harnessing the power of the social media analytics platform, Crimson Hexagon. The patented technology developed at Harvard University’s Institute for Quantitative Social Science enabled us to make our analysis very scalable. We selected small sample posts of fan excitement and fan predictions and classified them into relevant categories. The platform then was able to efficiently and accurately do all the heavy lifting for us. A total of 36,103 relevant tweets were analyzed in a matter of minutes. The results are then visualized, and we begin to see a story unfold which ultimately leads us to the same prediction.

The fans have the most intimate knowledge of their teams, and of course this knowledge and excitement / expectation level is reflected in their sentiment online. Based on the categories we setup to group predictions and excitement, we saw the following:

Those Panthers fans sure are excited!!!

The above data is shown in aggregate over the course of the last week. How about if we do a daily trend of the number of social posts for those same groupings?

Social sentiment weighs in favor of the Panthers, but you can see a strong increase for the Broncos towards on the 2nd of Feb. This is going to make for a very interesting game.

Conclusion

So there you have it. Let’s go Panthers!

All credit for this post goes to my colleagues Mike Anderson and Abdullah Alkeilani for their insight and input.

Permalink

https://community.digitalanalyticsassociation.org/blogs/shiraz-asif/2016/02/05/using-data-to-predict-the-superbowl-50-winner

Comments

David Campanella

02-11-2016 05:59 PM

Actually I can't go back to your table and see the probability of the actual score since we were zoomed past the panthers scoring only 10 points. :)
Great post though, very interesting in trying to predict something that is very difficult to quantify.

Shiraz Asif

02-08-2016 01:17 PM

Funny :) it's just sports. Predicting results adds to the excitement. :) Would have been great to be right!

Erik Danley

02-08-2016 07:55 AM

This is why data can't be used to predict sporting games.

Blogs