It’s that time of year again! SuperBowl 50 is upon us, and this year
the Bay Area is alive with excitement. Unfortunately it’s not because
of our beloved Niners, but this game has the makings of a classic. The
Denver Broncos against the upstart Carolina Panthers. The legend Peyton
Manning against the up and coming new kid on the block, Cam Newton.
Last year,
we used BigQuery to make a prediction about who would win the game.
Our prediction of the Seahawks winning was ultimately wrong, but the
data we used didn’t factor into consideration the impact of deflated
footballs :).
The focus of last year’s post was to showcase the capabilities of
BigQuery, so we used the “Rank” function in BigQuery to evaluate both
teams against each other in various offensive and defensive statistics.
The one that “ranked” higher in more statistics was predicted to be the
winner.
This year, we’ll try something a little different and perhaps a bit
more scientific. Not only will we predict an outcome, we’ll also
predict a score – based on the data. We’ll even throw in an interesting
wrinkle – the excitement of the fans! Read on to find out what that
means.
Now don’t go off and start betting all your hard earned money based
off our prediction. Analytics is supposed to be directionally correct,
but not absolute? Besides, maybe some devious no-good cheater will
lower the uprights this year and throw my thought process off
completely!
What does the data suggest? Let’s start by comparing the two teams side by side:
The Panthers seem to have an edge, but I don’t know if I would be too
quick to write off Peyton Manning. The man is a living legend and
brings with him a ton of intangibles. Here’s his chance to ride off into
the sunset as a SuperBowl champion. He will be motivated. The
question is: Will that be enough to overcome the juggernaut that is Cam
Newton and the Carolina Panthers? Let’s see what the data says.
The Prediction
In order to build a prediction, we started by downloading data for all games of the current season and playoffs-to-date from http://www.pro-football-reference.com as a CSV file.
For the purpose of this simple example, we used Excel. I thought of
leveraging R, but this post would have turned into a how-to of R and
lost focus, so I’ll leave that for a future post.
Once opened in Excel, the data looks like: (below is just a small sample of the full dataset)
To keep things simple, we’ll only use Columns B through E.
- PtsW (Column D) is the number of points for the winning team.
- PtsL (Column E) is the number of points for the losing team
The first step in formulating a prediction was to determine a league
wide reference from which to compare these two teams. We calculated the
Average Points Scored by the Winning Team by summing the total number
of points scored by the winning team and dividing by the number of games
played. We then repeated this step to also calculate the Average
Points Scored by the Losing Team. The calculated averages were:
- Avg. Points Scored by Winning Team: 28.29
- Avg. Points Scored by Losing Team: 17.22
This shows that on average, the winning team scored ~28 points, and
allowed ~17 points. These averages serve as a reference point from
which we can determine the relative performance of both the Panthers and
the Broncos. The next step is to isolate all the games for the
Panthers and Broncos only, and then calculate the Avg. Points Scored and
Allowed for the Panthers/Broncos only. I ended up with the following:
This is a really great starting point because we can easily see that
the Panthers are a far better offensive team than the Broncos, but also
better than the league average of 28.29. It also shows the Broncos are
the better defensive team.
Now that we have both the league and team specific averages, we
should be able to determine how this average can be used to determine a
relative performance index.
We decided to build out an Offensive and Defensive Strength Index by
taking the team averages and dividing them by the league averages. This
resulted in the following table:
*Note: A higher offensive strength index indicates more Offensive
Strength. The opposite holds true for Defensive Strength. The lower
the index, the greater the team’s Defensive Strength. In other words,
it is expected that a team will score less playing against a team with a
lower Defensive Strength index, as compared to playing against one with
a higher index.
The above supports our assertion that the Broncos are a better
defensive team, while the Panthers are a far better offensive team.
Defense wins championships right? Or is the best defense a good
offense? Which is it? Let’s see which metric proves to be stronger in
our prediction.
The next step is to determine a way to use this performance index to
calculate a predicted result. In order to do this, we will use the
following formula:
Broncos predicted score =
Broncos Offensive Strength X Panthers Defensive Strength X Broncos Average Points Scored
Panthers predicted score =
Panthers Offensive Strength X Broncos Defensive Strength X Panthers Average Points Scored
This formula helps us weigh the two teams’ relative performance over
the course of the year against each other and come up with a prediction.
While this is may be an overly simple way of looking at this, the
point of the post is to highlight a simple yet sound data driven
mechanism to formulate a prediction. It’s often best to just keep
things simple, both in sports predictions and especially in business
when we are using data to make educated decisions, and forecast future
performance.
So let’s figure out who will win!
Broncos predicted score =
0.78 X 1.12 X 22.11
Panthers predicted score =
1.14 X 1.06 X 32.22
Based on the calculation above, our prediction is…
(drum roll please)
The Panthers will win SuperBowl 50!
Broncos: 19
Panthers: 39
Probability Matrix
Wait…there’s more. Our prediction can’t possibly stop there! The
data shows the Panthers will win, and by quite a resounding margin, but
what is the likelihood of this result? To answer that question, we turn
our attention to Data Science and leverage the Poisson Distribution to
check the probability of the above result. When laid out on a matrix of
probable results, we end up with a rather large monstrosity of results.
In the diagram below, the X axis at the top represents the possible
points scored by the Broncos, and the Y axis on the left represents the
possible points scored by the Panthers.
I could have included the entire matrix of probabilities but nothing
would be legible, so I zoomed into the matrix to highlight the most
probable results as follows:
If you look at our predicted score of 39-19, you can see that it has a
57.9% probability of being accurate, which is the highest probability
in the matrix. After the game is over, you can come back to this post
and look up the score to determine what the actual probability was!
We derived this result by comparing the relative performance of the
two teams against each other. It will be interesting to see how this
model holds true for previous SuperBowl’s but I won’t go into that
today.
Fan Sentiment
In sports, another factor that is often overlooked is the excitement
of the fans, which I’ll call the “fan hype factor.” The hype of the
fans often adds an intangible element to the game, either adding
tremendous pressure to the athletes by setting unrealistic expectations,
or in the case of a team that is underdog, a feeling of unwavering
support. Ask any athlete – you can’t underestimate the power of the
hype factor.
Instead of taking a standard approach to seeking out fan opinion, we
tapped into the one place where avid fans massively express their
unsolicited opinions online: social media!
We wanted to take a closer look at the fan-generated conversation
surrounding the two teams by harnessing the power of the social media
analytics platform, Crimson Hexagon.
The patented technology developed at Harvard University’s Institute
for Quantitative Social Science enabled us to make our analysis very
scalable. We selected small sample posts of fan excitement and fan
predictions and classified them into relevant categories. The platform
then was able to efficiently and accurately do all the heavy lifting for
us. A total of 36,103 relevant tweets were analyzed in a matter of
minutes. The results are then visualized, and we begin to see a story
unfold which ultimately leads us to the same prediction.
The fans have the most intimate knowledge of their teams, and of
course this knowledge and excitement / expectation level is reflected in
their sentiment online. Based on the categories we setup to group
predictions and excitement, we saw the following:
Those Panthers fans sure are excited!!!
The above data is shown in aggregate over the course of the last
week. How about if we do a daily trend of the number of social posts
for those same groupings?
Social sentiment weighs in favor of the Panthers, but you can see a
strong increase for the Broncos towards on the 2nd of Feb. This is
going to make for a very interesting game.
Conclusion
So there you have it. Let’s go Panthers!
All credit for this post goes to my colleagues Mike Anderson and Abdullah Alkeilani for their insight and input.