Tips and Tricks #133: Use Generators for Memory-Efficient Data Processing

Published on June 22, 2025

Process large datasets without loading everything into memory using Python generators.

Code Snippet

# Before: Loads entire file into memory
def process_file_eager(filename):
    with open(filename) as f:
        lines = f.readlines()  # All in memory!
    return [parse_line(line) for line in lines]

# After: Streams data lazily
def process_file_lazy(filename):
    with open(filename) as f:
        for line in f:
            yield parse_line(line)

# Usage: Memory stays constant regardless of file size
for record in process_file_lazy("huge_file.csv"):
    process_record(record)

Why This Helps

Constant memory usage regardless of data size
Enables processing of files larger than RAM
Integrates seamlessly with for loops and itertools

How to Test

Monitor memory with memory_profiler
Compare peak memory on large files

When to Use

ETL pipelines, log processing, any scenario with large sequential data.

Performance/Security Notes

Generators can only be iterated once. Use itertools.tee() if multiple passes needed.

References

https://docs.python.org/3/howto/functional.html#generators

Try this tip in your next project and share your results in the comments!

Discover more from Byte Architect

Subscribe to get the latest posts sent to your email.

Previous

Building the Modern Data Stack: How Spark, Kafka, and dbt Transformed Data Engineering

Next

Tips and Tricks #132: Freeze Collections for Thread-Safe Read Access

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Searching in

Enter search term to find items

to navigate, to select, and to close