Data Engineering – Page 2 – C4: Container, Code, Cloud & Context

ETL for Vector Embeddings: Preparing Data for RAG

Posted on November 20, 2025 by Nithin Mohan TK 14 min read

Preparing data for RAG requires specialized ETL pipelines. After building pipelines for 50+ RAG systems, I’ve learned what works. Here’s the complete guide to ETL for vector embeddings.

Read more →

Data Pipelines for LLM Training: Building Production ETL Systems

Posted on November 8, 2025 by Nithin Mohan TK 13 min read

Building production ETL pipelines for LLM training is complex. After building pipelines processing 100TB+ of data, I’ve learned what works. Here’s the complete guide to building production data pipelines for LLM training. Figure 1: LLM Training Data Pipeline Architecture Why Production ETL Matters for LLM Training LLM training requires massive amounts of clean, processed data: […]

Read more →

Modern Python Patterns for Data Engineering: From Async Pipelines to Structural Pattern Matching

Posted on November 5, 2025 by Nithin Mohan TK 11 min read

Introduction: Modern Python has evolved dramatically with features that transform how we build data engineering systems. This comprehensive guide explores advanced Python patterns including structural pattern matching, async/await for concurrent data processing, dataclasses and Pydantic for robust data validation, and context managers for resource management. After building production data pipelines across multiple organizations, I’ve found […]

Read more →

Production Data Pipelines with Apache Airflow: From DAG Design to Dynamic Task Generation

Posted on October 28, 2025 by Nithin Mohan TK 7 min read

After 20 years in enterprise data engineering, I’ve implemented Apache Airflow across healthcare, financial services, and cloud-native architectures. This article shares production-tested patterns for building resilient, scalable data pipelines—from DAG design principles to dynamic task generation strategies that handle thousands of workflows. 1. The Fundamentals: Why Airflow Remains the Standard Apache Airflow has become the […]

Read more →

Orchestrating Enterprise Data Pipelines with Google Cloud Composer and Apache Airflow

Posted on May 28, 2025 by Nithin Mohan TK 9 min read

Production-tested patterns for orchestrating enterprise data pipelines with Google Cloud Composer and Apache Airflow. Includes architecture, code examples, security, and cost optimization strategies.

Read more →

Searching in

Category: Data Engineering

Data Pipelines for LLM Training: Building Production ETL Systems

Modern Python Patterns for Data Engineering: From Async Pipelines to Structural Pattern Matching

Production Data Pipelines with Apache Airflow: From DAG Design to Dynamic Task Generation

Orchestrating Enterprise Data Pipelines with Google Cloud Composer and Apache Airflow