Use MERGE (upsert) for safe, rerunnable data pipelines that handle duplicates gracefully.
Code Snippet
-- PostgreSQL: INSERT ON CONFLICT (upsert)
INSERT INTO dim_customer (customer_id, name, email, updated_at)
SELECT customer_id, name, email, NOW()
FROM staging_customers
ON CONFLICT (customer_id)
DO UPDATE SET
name = EXCLUDED.name,
email = EXCLUDED.email,
updated_at = NOW()
WHERE dim_customer.name != EXCLUDED.name
OR dim_customer.email != EXCLUDED.email;
-- SQL Server / Snowflake: MERGE
MERGE INTO dim_customer AS target
USING staging_customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND (target.name != source.name OR target.email != source.email) THEN
UPDATE SET name = source.name, email = source.email, updated_at = GETDATE()
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, updated_at)
VALUES (source.customer_id, source.name, source.email, GETDATE());
Why This Helps
- Safe to rerun without duplicates
- Single statement instead of delete+insert
- Tracks only actual changes
How to Test
- Run pipeline twice, verify no duplicates
- Check row counts and updated_at timestamps
When to Use
Dimension table loads, CDC processing, any ETL that may be rerun.
Performance/Security Notes
Always include updated_at for auditing. Consider SCD Type 2 for historical tracking.
References
Try this tip in your next project and share your results in the comments!
Discover more from Code, Cloud & Context
Subscribe to get the latest posts sent to your email.