Data Engineering

Public NHL data modeling

A data engineering journey towards modernizing NHL data

Date: 05 Jun 2022
Tags: all-projects , data-engineering , python , singer , meltano , bigquery , dbt

This is first of a series of posts that delves into exploring, organizing, and modeling on public Hockey (NHL) data. I am one of the two developers of this work alongside my colleague Gavin He. I will occasionally cross-post the cool work we do within our “organization” (called the-data-base) here on fullstaxx.

A modern application of data-engineering to enable data science on public Hockey (NHL) data for the purposes of learning & development

Introduction
Architecture
Setup
Resources
Developer contact

Introduction

The motivation behind this project was simple: make public hockey data available using modern technologies for the purposes of data-science & data-visualization. We wanted to be able to answer questions like…

Which players are most likely to have a breakout season next year?
Which draft prospects are most likely to succeed in the NHL?
How many goals should we expect from elite players like Connor McDavid or Auston Matthews next season?
Where on the ice are individual players most efficient with their shooting?

Architecture

In order to get to this state of-course, a lot of data-engineering was necessary. Below is a visual representation of the project architecture.

Miro project architecture

Data extraction

Currently, we only have a single source of data: the NHL Stats API. The Github repo that we built to extract the data is called tap-nhl. It is a Singer tap for the NHL Stats API.

Built with the Meltano Tap SDK for Singer Taps.

Below is a flow diagram explaining how it works:

Mermaid Plot

Resources

Repo: tap-nhl

Data transformation & loading

All of this work is contained within a Github repo called dbt-nhl-breakouts and uses dbt to model our raw data. It contains the source code used to transform raw nhl data from the NHL Stats API into analysis-ready models.

In other words, this is where the SQL magic happens using dbt. Ultimately, this work converts confusing raw data into:

Data analyst/scientist friendly datasets all within one data warehouse (BigQuery)
Well-documented tables, field definitions, and queries
Reliable data that is tested and validated before ever making it into production

Resources

Repo: dbt-nhl-breakouts
Documentation: dbt generated documentation

Data science

Consider this section separate from the rest. Each question that we decide to answer of our newly modeled data will live in this bucket. For example, one of the projects that spawned from this was the nhl-breakouts project