Tips and Tricks #218: Implement Idempotent ETL with Merge Statements

Use MERGE (upsert) for safe, rerunnable data pipelines that handle duplicates gracefully.

Code Snippet

-- PostgreSQL: INSERT ON CONFLICT (upsert)
INSERT INTO dim_customer (customer_id, name, email, updated_at)
SELECT customer_id, name, email, NOW()
FROM staging_customers
ON CONFLICT (customer_id) 
DO UPDATE SET
    name = EXCLUDED.name,
    email = EXCLUDED.email,
    updated_at = NOW()
WHERE dim_customer.name != EXCLUDED.name
   OR dim_customer.email != EXCLUDED.email;

-- SQL Server / Snowflake: MERGE
MERGE INTO dim_customer AS target
USING staging_customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND (target.name != source.name OR target.email != source.email) THEN
    UPDATE SET name = source.name, email = source.email, updated_at = GETDATE()
WHEN NOT MATCHED THEN
    INSERT (customer_id, name, email, updated_at)
    VALUES (source.customer_id, source.name, source.email, GETDATE());

Why This Helps

  • Safe to rerun without duplicates
  • Single statement instead of delete+insert
  • Tracks only actual changes

How to Test

  • Run pipeline twice, verify no duplicates
  • Check row counts and updated_at timestamps

When to Use

Dimension table loads, CDC processing, any ETL that may be rerun.

Performance/Security Notes

Always include updated_at for auditing. Consider SCD Type 2 for historical tracking.

References


Try this tip in your next project and share your results in the comments!


Discover more from Byte Architect

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.