Tips and Tricks #133: Use Generators for Memory-Efficient Data Processing

Process large datasets without loading everything into memory using Python generators.

Code Snippet

# Before: Loads entire file into memory
def process_file_eager(filename):
    with open(filename) as f:
        lines = f.readlines()  # All in memory!
    return [parse_line(line) for line in lines]

# After: Streams data lazily
def process_file_lazy(filename):
    with open(filename) as f:
        for line in f:
            yield parse_line(line)

# Usage: Memory stays constant regardless of file size
for record in process_file_lazy("huge_file.csv"):
    process_record(record)

Why This Helps

  • Constant memory usage regardless of data size
  • Enables processing of files larger than RAM
  • Integrates seamlessly with for loops and itertools

How to Test

  • Monitor memory with memory_profiler
  • Compare peak memory on large files

When to Use

ETL pipelines, log processing, any scenario with large sequential data.

Performance/Security Notes

Generators can only be iterated once. Use itertools.tee() if multiple passes needed.

References


Try this tip in your next project and share your results in the comments!


Discover more from Byte Architect

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.