Use MERGE (upsert) for safe, rerunnable data pipelines that handle duplicates gracefully.
Code Snippet
-- PostgreSQL: INSERT ON CONFLICT (upsert)
INSERT INTO dim_customer (customer_id, name, email, updated_at)
SELECT customer_id, name, email, NOW()
FROM staging_customers
ON CONFLICT (customer_id)
DO UPDATE SET
name = EXCLUDED.name,
email = EXCLUDED.email,
updated_at = NOW()
WHERE dim_customer.name != EXCLUDED.name
OR dim_customer.email != EXCLUDED.email;
-- SQL Server / Snowflake: MERGE
MERGE INTO dim_customer AS target
USING staging_customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND (target.name != source.name OR target.email != source.email) THEN
UPDATE SET name = source.name, email = source.email, updated_at = GETDATE()
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, updated_at)
VALUES (source.customer_id, source.name, source.email, GETDATE());
Why This Helps
- Safe to rerun without duplicates
- Single statement instead of delete+insert
- Tracks only actual changes
How to Test
- Run pipeline twice, verify no duplicates
- Check row counts and updated_at timestamps
When to Use
Dimension table loads, CDC processing, any ETL that may be rerun.
Performance/Security Notes
Always include updated_at for auditing. Consider SCD Type 2 for historical tracking.
References
Try this tip in your next project and share your results in the comments!
Discover more from Byte Architect
Subscribe to get the latest posts sent to your email.