Skip to main content

Command Palette

Search for a command to run...

25: Netflix Data Pipeline with dbt and Snowflake 🎬

Published
•7 min read
25: Netflix Data Pipeline with dbt and Snowflake 🎬

Hey Data Engineers! 👋

In this project, we build an end-to-end data pipeline using dbt (Data Build Tool) and Snowflake, simulating a Netflix-style dataset. Starting from raw data in Amazon S3, we load it into Snowflake, apply transformations using dbt models, and implement best practices such as incremental and ephemeral materializations, snapshots, testing, and documentation. The project demonstrates how modern analytics workflows enable scalable, modular, and maintainable data engineering pipelines.

Project Setup

  • We create a new bucket ‘netflix-dbt-mehul-dataset‘ inside Amazon S3, which will serve as the layer where raw data gets extracted into.

  • Inside the bucket, we upload the following six files that simulate the Netflix’s movies dataset.

  • Now, we need to load the raw data into Snowflake warehouse. For achieving this, we create a new SQL Worksheet inside Snowflake.

  • We create a new role ‘TRANSFORM‘ and assign it to ‘ACCOUNTADMIN‘ role which has full privileges. Also, we create a new warehouse ‘COMPUTE_WH‘, which will provide the compute resources for the queries. Further, we create a new user ‘dbt‘ and configure its default role and warehouse as the ones we just created. Also, we create the database ‘MOVIELENS‘, which we will use throughout our project.

  • Finally, we grant all privileges on the resources we just created.

Connecting to AWS

  • Firstly, we create a new user ‘snowflakenetflixuser‘ with full access to Amazon S3.

  • Now, we create a stage inside Snowflake by providing the AWS user credentials that we configured in the previous step.

Loading data into Snowflake

  • We create a new table ‘raw_movies’ and load the data inside it from ‘movies.csv’ file inside S3.

  • Similarly, we create and populate the ‘raw_ratings‘, ‘raw_tags‘ and other tables as well.

DBT Setup

  • Firstly, we create a virtual environment inside VS Code.

  • For installing and initializing dbt-snowflake, we execute the following commands.
install dbt-snowflake using 

pip install dbt-snowflake==1.9.0

mkdir %userprofile%\.dbt

dbt init netflix
  • After successful initialization of DBT, the following ‘dbt_project.yml’ file gets created.

Models

  • DBT provides us with the ability to create modular SQL models.

  • Inside ‘models’ folder, we create a folder ‘staging‘, inside which we create a file ‘src_movies.sql‘, with the following code. After creating the file, we run ‘dbt run’ command inside the terminal.

  • Inside Snowflake, a view ‘SRC_MOVIES‘ gets created corresponding to our model.

  • Similarly, we create 5 more corresponding SQL models for our 5 other data files.

  • We get the corresponding views in Snowflake.

  • Inside the ‘dbt_project.yml’ file, we have the capability to choose the right materialization strategy. For our project, any models that would be created under ‘dim‘ and ‘fct‘ folders will create tables corresponding to them inside Snowflake.

Referencing another model

  • Inside ‘models’, we create a folder ‘dim’, which will store all the dimensional tables. Inside ‘dim’, we create a new file ‘dim_movies.sql’ and run the ‘dbt run’ command again.

  • This results in the creation of a corresponding table inside Snowflake, as expected.

  • Similarly, we create the dimension SQL model ‘dim_users.sql‘.

  • Similarly, we create ‘dim_genome_tags.sql’ model and execute the ‘dbt run’ comand.

  • As expected, the corresponding tables get created inside Snowflake as well.

  • Inside ‘models’, we create one more folder ‘fct’ for fact tables and create a SQL script ‘fct_genome_scores.sql’ and execute ‘dbt run’.

  • The table gets created as expected inside Snowflake.

Incremental materialization

  • In dbt, incremental materialization builds a table by processing only new or updated records instead of reloading the entire dataset each time.

  • Inside ‘fct’ folder, we create one more SQL model ‘fct_ratings.sql‘ and configure it for incremental materialization and execute the ‘dbt run’ command for this particular model.

  • We can confirm that the data gets populated inside a Snowflake table.

  • Inside ‘src_ratings’ table, we insert a new row with a greater timestamp than the highest timestamp available inside the table.

  • Now, if we run the ‘dbt run —select’ command on the ‘fct_ratings’ SQL model that depends upon the ‘src_rating’ table we just modified, the incremental model gets executed.

  • As expected, the ‘fct_ratings’ table gets modified and the incremental data gets added into it.

Ephemeral materialization

  • In dbt, ephemeral materialization creates temporary models that exist only during the run and are never written as tables or views. Instead, their SQL is inlined into downstream queries.

  • Inside ‘dim’ folder, we create ‘dim_movies_with_tags.sql’ model and create the configuration for ephemeral materialization and run the ‘dbt run’ for this model.

  • As stated in the definition, it doesn’t create any table or view.

  • But, if we create a new model ‘ep_movie_with_tags.sql‘ which references the ephemeral model and run the ‘dbt run’ command for it, the ephemeral model gets executed in runtime.

  • The model which depends upon the ephemeral table gives the exact same output that was expected from the ephemeral model.

Seed file

  • Seeds are useful for static reference data like country codes, mappings, or lookup values.

  • Inside ‘seed folder’, we create a file ‘seed_movie_release_dates.csv‘ and populate it with the following data. Afterwards, we run the ‘dbt seed’ command.

  • The corresponding table gets created inside Snowflake.

  • For the sake of completion, we create a new folder ‘mart‘ inside ‘models’ and create the SQL model ‘mart_movie_releases.sql‘ inside it. Then, we run this model.

  • The corresponding table gets created inside Snowflake as well.

  • Now, we create a ‘sources.yml’ file inside ‘models’ folder. It lets us define our raw data sources and reference these in models, in test and in documenting lineage.

  • Now, we can replace the source names accordingly.

Snapshots

  • In dbt, a snapshot is used to track and record historical changes in a table over time. It captures slowly changing dimensions (SCD Type 2) by storing previous values whenever a row is updated, enabling time-travel analysis and auditing of data.

  • Inside ‘snapshots’ folder, we create a file ‘snap_tags.sql’ and configure it as follows. Note that we use a surrogate key from a dbt package ‘db_utils‘, for which we need to create a ‘packages.yml’ file.

  • The packages.yml file looks as follows. Also, in order to install this package, we run the command ‘dbt deps’ command.

  • In order to create the snapshot, we run the ‘dbt snapshot‘ command.

  • Inside Snowflake, the ‘SNAPSHOTS’ folder gets created along with the corresponding ‘snap_tags‘ table.

  • This table looks as follows.

  • For understanding SCD Type 2, we update the ‘src_tags’ table as follows.

  • In order to take the snapshot of our modified table, we run the ‘‘dbt snapshot’ command again.

  • Now, we can see two versions of the same ‘user_id’, with different timestamps and their validity.

Testing

  • In dbt, general tests are predefined tests that check common data quality rules, such as unique, not_null, or accepted_values.

  • Inside ‘models’ folder, we create a file ‘schema.yml’ that looks as follows.

  • In order to test this, we execute ‘dbt test’ command. Note that all tests get passed.

  • Singular tests are custom SQL queries written by the user to validate specific conditions.

  • In order to create a singular test, we create a file ‘relevance_score_test.sql‘ inside ‘tests‘ folder. After creating the file, we run ‘dbt test’ to see if it passes. Note that for the condition ‘relevance_score <= 0’, the test case passes, meaning that all the relevance scores are positive.

Documentation

  • We run the command ‘dbt docs generate‘ and get the catalog built.

  • Now, we run ‘dbt docs serve’ which starts a local server for our documentation.

  • We also get a lineage graph for our models.

Macros

  • Macros are reusable pieces of SQL or logic written using Jinja templating.

  • Inside ‘macros’ folder, we create ‘no_nulls_in_columns.sql’ file, which looks as follows.

  • Now, inside ‘tests’ folder and inside the ‘relevance_score_test.sql’ file, we call the macro and execute the ‘dbt test’ command. Note that all tests pass.

Analysis

  • Inside the analyses folder, we create a file ‘movie_analysis.sql‘, which looks as follows. Also, we run the ‘dbt compile’ command.

  • We can get the compiled ‘movie_analysis.sql‘ file inside the ‘target’ folder.

  • We can execute this command inside Snowflake and perform further analysis upon the data.

Conclusion

Through this project, we explored the full lifecycle of a data engineering workflow using dbt and Snowflake. From loading raw data to building dimensional and fact tables, configuring materializations, implementing snapshots, and running data quality tests, we showcased how dbt facilitates efficient, reliable, and well-documented transformations.

Stay tuned!