Tips and Tricks #154: Implement Idempotent ETL with Merge Statements

Published on August 3, 2025

Use MERGE (upsert) for safe, rerunnable data pipelines that handle duplicates gracefully.

Code Snippet

-- PostgreSQL: INSERT ON CONFLICT (upsert)
INSERT INTO dim_customer (customer_id, name, email, updated_at)
SELECT customer_id, name, email, NOW()
FROM staging_customers
ON CONFLICT (customer_id) 
DO UPDATE SET
    name = EXCLUDED.name,
    email = EXCLUDED.email,
    updated_at = NOW()
WHERE dim_customer.name != EXCLUDED.name
   OR dim_customer.email != EXCLUDED.email;

-- SQL Server / Snowflake: MERGE
MERGE INTO dim_customer AS target
USING staging_customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND (target.name != source.name OR target.email != source.email) THEN
    UPDATE SET name = source.name, email = source.email, updated_at = GETDATE()
WHEN NOT MATCHED THEN
    INSERT (customer_id, name, email, updated_at)
    VALUES (source.customer_id, source.name, source.email, GETDATE());

Why This Helps

Safe to rerun without duplicates
Single statement instead of delete+insert
Tracks only actual changes

How to Test

Run pipeline twice, verify no duplicates
Check row counts and updated_at timestamps

When to Use

Dimension table loads, CDC processing, any ETL that may be rerun.

Performance/Security Notes

Always include updated_at for auditing. Consider SCD Type 2 for historical tracking.

References

https://docs.snowflake.com/en/sql-reference/sql/merge

Try this tip in your next project and share your results in the comments!

Discover more from Code, Cloud & Context

Subscribe to get the latest posts sent to your email.

Previous

The Architecture Decision That Will Make or Break Your System: Monolith vs Microservices in 2025

Next

Tips and Tricks #153: Use Window Functions for Running Calculations

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Searching in

Enter search term to find items

to navigate, to select, and to close