Memory Management During Data Processing

Created: 17 May 2024, Modified: 20 May 2024 •

3 min read

Thoughts of Memory Management During Data Processing

Imports from managing the memory and resource during processing data and programming.

Watch this video for more understanding, Click here

Key message:

Why we need to clean up TempDb in Data Warehouse
How to optimize data pipeline performance?
Remove redundancy when you are building the code package
Optimization method: Plan-Do-Check-Act

alt text

CHANGELOG

[2024-05-18] Initial version
[2024-05-21] Update Video for lightning talk

Key Notes

TempDb Error

During data migration process, we have to focus on the temporary management and storage location of data, making sure no bottleneck or running out of space
Long Running Query can consume a lot of memory and TempDb, by optimizing running queries, re-shaping data (if skewed) helps mitigate the such issues.

Data Pipeline Error

During large data processing, chunk processing to avoid lack of memory as well as preventing memory overflow and improve overall performance.
Long Data Processing leads to “System Not Respond”.

Compute Error

During Coding Practices on Hackerank, I found one test cases where (in below example).
Particularly those involving large inputs or complex operations that could lead to excessive memory usage.
Examples:

```python title=”Python Code Example” hl_lines=”21 22”

def findNumberSequence(direction): ‘’’ Description: Finding the sequence of numbers placed n the segment in the order of their placement points. Param: - direction (str): A string of length n where (‘L’ and ‘R’) indicate th direction of the turn. Ex: LRLLR Return: - result (list): An integer list of sequence of numbers placed on the segment after ordered by direction Time Complexity: O(n log n) due to sorting the positions ‘’’ n = len(direction) segment_start = 0 segment_end = 2 ** n positions = [] values = [] particularly those involving large inputs or complex operations that could lead to excessive memory usage if n < 1 or n > 10**5: raise ValueError(“Input size exceeds the allowed limit”)

for i in range(n):

    center = (segment_start + segment_end) // 2 # Note: Change due to MemoryError
    # center = segment_start + (segment_end - segment_start) // 2
    positions.append(center)
    values.append(i + 1)

    if direction[i] == 'L':
        segment_end = center
    else:
        segment_start = center

combined = list(zip(positions, values))
combined.sort()

result = [value for _, value in combined]

return result ```

Data pipeline breaks when out of memory

Solution Approaches

Using Try-Catch, Robust error handling
Using Transaction
Import used Libraries
Monitor Utilization
Defining Proper Test cases to cover the Runtime performance