Snowflake is one of the leading Cloud Data Warehouses. With the ability to provide amazing speed and execution, it has revolutionised data and analytics. The first step in unlocking the power of Snowflake is to load and combine your data from your existing source systems.
Each of our clients have their own unique operating environment, and while there is no single approach that will work for all, there are common proven patterns that we utilise when they are appropriate. In this article we will talk through one of our recommended data pipeline patterns and why each part of the stack is utilised.
S3: Your Data's Gateway to Snowflake
It is likely that you have data in a variety of source systems that you will want to load into Snowflake. Your data pipeline is responsible for moving the data from the source and eventually into a transformed data model sitting inside Snowflake. There are a million different ways to achieve that objective. The thing is that if you build each pipeline from source to destination in isolation, you are quickly going to end up with a sprawling technology landscape that presents a headache to maintain. What we try to do is to get the sources utilsing the same data pipeline pattern as quickly as possible. In our experience the AWS S3 bucket, generally provides a path to consistency and ease of loading into Snowflake.
The utility of S3
Many source systems now include an out-of-the-box data pushing capability directly to S3. This can be a real time streaming approach or a batch/scheduled approach. Either way, if your source system has this capability, you are winning and should definitely consider utilising it in your data pipeline.
The SaaS Saviors
If your source system doesn’t natively support an S3 push, it’s OK there are a wealth of SaaS solutions that in most cases have the capability to move data from source to S3. There are a wide range of low code visual SaaS tools including the likes of FiveTran, Hevo, Skyvia, Zapier, AWS AirFlow and AWS Glue to name just a few.
If you are looking for a more obscure connector then you can also check out Portable.io who have over 500 connectors and are willing to write custom connectors if there is enough market interest.
Once data is staged in S3, you then can have that important single unified data loading and transformation pattern for all your source systems into Snowflake.
Setting the Stage
Snowflake has a tight integration with S3 through the concept of the external stage. It is a simple four step process to setup your stage in S3 and move the data into the landing zone inside Snowflake.
Step 1: Login and Select Database and Schema
Log in to your Snowflake account.
Select the database and schema where you want to create the stage.
Step 2: Create the External Stage
Use a simple SQL command like this: CREATE OR REPLACE STAGE my_stage URL = 's3://your-bucket-name';
Step 3: Grant Permissions
Grant permissions to users or roles as needed.
Step 4: Copy your data from S3 to a SnowFlake table
Then use this simple SQL command to copy the data into your SnowFlake table COPY INTO my_table FROM @my_stage FILES = ('file1.json') FILE_FORMAT = (TYPE = 'JSON');.
Data transformation with dbt
Transformation is where the real magic happens, and our tool of choice is dbt. dbt is short for "data build tool”. It's a cost-effective, user-friendly, and downright brilliant tool that makes data transformation a breeze. Here's why we like dbt:
Ease of Use - Its intuitive interface and simple syntax, makes it easy for us and easy for our clients to manage themselves.
Cost-Effective - dbt is budget-friendly, even the free version can provide what smaller clients need.
Extensive Transformation Capabilities - dbt's use of SQL to create custom business logic within a tested environment, makes dbt much more accessible for a wider range of users.
Versioned Control - centralised and versioned controlled scripts mean that team collaboration is much each.
In-Built Data Lineage - this is particularly useful when migrating legacy spaghetti pipelines.
Snowflake is a leading Cloud Data Warehouse, due to its remarkable speed and execution. To unlock its full potential, the journey begins with consolidating data from your diverse source systems.
While there's no universal solution, we've explored a recommended data pipeline pattern and common practices. The key is to funnel the data as quickly as possible into a consistent pipeline, for us that often begins with AWS S3, whether natively or through SaaS solutions. Snowflake's integration with S3 ensures a seamless transition, while dbt empowers data transformation with its user-friendly tools and extensive capabilities.