Long Bui Discovering new things. Data x Platform Ops

Memory Management During Data Processing

Thoughts of Memory Management During Data Processing

Imports from managing the memory and resource during processing data and programming.

Watch this video for more understanding, Click here

Key message:

  • Why we need to clean up TempDb in Data Warehouse
  • How to optimize data pipeline performance?
  • Remove redundancy when you are building the code package
  • Optimization method: Plan-Do-Check-Act

alt text

CHANGELOG

  • [2024-05-18] Initial version
  • [2024-05-21] Update Video for lightning talk

Key Notes

TempDb Error

  • During data migration process, we have to focus on the temporary management and storage location of data, making sure no bottleneck or running out of space
  • Long Running Query can consume a lot of memory and TempDb, by optimizing running queries, re-shaping data (if skewed) helps mitigate the such issues. alt text

Data Pipeline Error

  • During large data processing, chunk processing to avoid lack of memory as well as preventing memory overflow and improve overall performance.
  • Long Data Processing leads to “System Not Respond”. alt text

Compute Error

  • During Coding Practices on Hackerank, I found one test cases where (in below example).
  • Particularly those involving large inputs or complex operations that could lead to excessive memory usage.
  • Examples:

```python title=”Python Code Example” hl_lines=”21 22”

def findNumberSequence(direction): ‘’’ Description: Finding the sequence of numbers placed n the segment in the order of their placement points. Param: - direction (str): A string of length n where (‘L’ and ‘R’) indicate th direction of the turn. Ex: LRLLR Return: - result (list): An integer list of sequence of numbers placed on the segment after ordered by direction Time Complexity: O(n log n) due to sorting the positions ‘’’ n = len(direction) segment_start = 0 segment_end = 2 ** n positions = [] values = [] particularly those involving large inputs or complex operations that could lead to excessive memory usage if n < 1 or n > 10**5: raise ValueError(“Input size exceeds the allowed limit”)

for i in range(n):

    center = (segment_start + segment_end) // 2 # Note: Change due to MemoryError
    # center = segment_start + (segment_end - segment_start) // 2
    positions.append(center)
    values.append(i + 1)

    if direction[i] == 'L':
        segment_end = center
    else:
        segment_start = center

combined = list(zip(positions, values))
combined.sort()

result = [value for _, value in combined]

return result ```
  • Data pipeline breaks when out of memory

Solution Approaches

  • Using Try-Catch, Robust error handling
  • Using Transaction
  • Import used Libraries
  • Monitor Utilization
  • Defining Proper Test cases to cover the Runtime performance

Subscribe to keep you posted the latest updates