Memory Management During Data Processing
Thoughts of Memory Management During Data Processing
Imports from managing the memory and resource during processing data and programming.
Watch this video for more understanding, Click here
Key message:
- Why we need to clean up TempDb in Data Warehouse
- How to optimize data pipeline performance?
- Remove redundancy when you are building the code package
- Optimization method: Plan-Do-Check-Act
CHANGELOG
- [2024-05-18] Initial version
- [2024-05-21] Update Video for lightning talk
Key Notes
TempDb Error
- During data migration process, we have to focus on the temporary management and storage location of data, making sure no bottleneck or running out of space
- Long Running Query can consume a lot of memory and TempDb, by optimizing running queries, re-shaping data (if skewed) helps mitigate the such issues.
Data Pipeline Error
- During large data processing, chunk processing to avoid lack of memory as well as preventing memory overflow and improve overall performance.
- Long Data Processing leads to “System Not Respond”.
Compute Error
- During Coding Practices on Hackerank, I found one test cases where (in below example).
- Particularly those involving large inputs or complex operations that could lead to excessive memory usage.
- Examples:
```python title=”Python Code Example” hl_lines=”21 22”
def findNumberSequence(direction): ‘’’ Description: Finding the sequence of numbers placed n the segment in the order of their placement points. Param: - direction (str): A string of length n where (‘L’ and ‘R’) indicate th direction of the turn. Ex: LRLLR Return: - result (list): An integer list of sequence of numbers placed on the segment after ordered by direction Time Complexity: O(n log n) due to sorting the positions ‘’’ n = len(direction) segment_start = 0 segment_end = 2 ** n positions = [] values = [] particularly those involving large inputs or complex operations that could lead to excessive memory usage if n < 1 or n > 10**5: raise ValueError(“Input size exceeds the allowed limit”)
for i in range(n):
center = (segment_start + segment_end) // 2 # Note: Change due to MemoryError
# center = segment_start + (segment_end - segment_start) // 2
positions.append(center)
values.append(i + 1)
if direction[i] == 'L':
segment_end = center
else:
segment_start = center
combined = list(zip(positions, values))
combined.sort()
result = [value for _, value in combined]
return result ```
- Data pipeline breaks when out of memory
Solution Approaches
- Using Try-Catch, Robust error handling
- Using Transaction
- Import used Libraries
- Monitor Utilization
- Defining Proper Test cases to cover the Runtime performance