Tips and Tricks #154: Implement Idempotent ETL with Merge Statements

Use MERGE (upsert) for safe, rerunnable data pipelines that handle duplicates gracefully.

Code Snippet

-- PostgreSQL: INSERT ON CONFLICT (upsert)
INSERT INTO dim_customer (customer_id, name, email, updated_at)
SELECT customer_id, name, email, NOW()
FROM staging_customers
ON CONFLICT (customer_id) 
DO UPDATE SET
    name = EXCLUDED.name,
    email = EXCLUDED.email,
    updated_at = NOW()
WHERE dim_customer.name != EXCLUDED.name
   OR dim_customer.email != EXCLUDED.email;

-- SQL Server / Snowflake: MERGE
MERGE INTO dim_customer AS target
USING staging_customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND (target.name != source.name OR target.email != source.email) THEN
    UPDATE SET name = source.name, email = source.email, updated_at = GETDATE()
WHEN NOT MATCHED THEN
    INSERT (customer_id, name, email, updated_at)
    VALUES (source.customer_id, source.name, source.email, GETDATE());

Why This Helps

  • Safe to rerun without duplicates
  • Single statement instead of delete+insert
  • Tracks only actual changes

How to Test

  • Run pipeline twice, verify no duplicates
  • Check row counts and updated_at timestamps

When to Use

Dimension table loads, CDC processing, any ETL that may be rerun.

Performance/Security Notes

Always include updated_at for auditing. Consider SCD Type 2 for historical tracking.

References


Try this tip in your next project and share your results in the comments!


Discover more from Code, Cloud & Context

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.