Skip to content

Latest commit

 

History

History
712 lines (539 loc) · 21.4 KB

File metadata and controls

712 lines (539 loc) · 21.4 KB

Architecture Guide: Building Memory-Efficient Features

Author: Manus AI
Version: 3.35.1
Last Updated: November 16, 2025

This guide provides architectural patterns and best practices for building features that handle large datasets efficiently, based on lessons learned from fixing the CRM Sync Mapper memory leak.


Table of Contents

  1. Core Principles
  2. The Memory Leak Pattern (Anti-Pattern)
  3. The Streaming Pattern (Recommended)
  4. File Upload Architecture
  5. Data Processing Architecture
  6. API Design Patterns
  7. Testing for Memory Leaks
  8. Checklist for New Features

Core Principles

When building features that handle large datasets, follow these fundamental principles to ensure memory efficiency and browser responsiveness.

Principle 1: Never Load Entire Datasets into Browser Memory

The browser is a constrained environment with limited memory. Loading large datasets (100k+ rows) into JavaScript memory causes performance degradation and crashes. Instead, process data in chunks or stream it directly to the server.

Bad Example:

// ❌ Loads 219k rows × 26 columns = 5.7M cells into memory
const data = Papa.parse(file).data;
const csvString = Papa.unparse(data); // Creates 500MB+ string
await uploadMutation({ csvContent: csvString }); // Sends 1.3GB through tRPC

Good Example:

// ✅ Processes in 10k row chunks, max 200MB in memory
for (let i = 0; i < totalRows; i += 10000) {
  const chunk = data.slice(i, i + 10000);
  await processChunk(chunk);
  await new Promise(resolve => setTimeout(resolve, 0)); // Let browser breathe
}

Principle 2: Use HTTP FormData for Large File Uploads

tRPC and other RPC frameworks serialize data as JSON, which requires loading the entire payload into memory. For files larger than 10MB, use HTTP FormData uploads instead.

Comparison:

Method Max Size Memory Usage Use Case
tRPC JSON ~10MB 3-5x file size Small metadata, API calls
HTTP FormData 2GB+ 1x file size Large files, CSV uploads
Streaming API Unlimited Constant (chunk size) Real-time data, logs

Principle 3: Process Server-Side When Possible

The server has more memory and CPU resources than the browser. When dealing with large datasets, upload raw files to the server and perform transformations server-side.

Architecture Pattern:

Browser                    Server                     Storage
--------                   ------                     -------
Upload raw file    →    Parse & validate    →    Store to S3
Show progress      ←    Stream progress     ←    Background job
Display results    ←    Return metadata     ←    Job completion

The Memory Leak Pattern (Anti-Pattern)

Understanding what not to do is crucial. The CRM Sync Mapper memory leak followed this anti-pattern.

How the Memory Leak Happened

The original implementation loaded all data into browser memory and sent it through tRPC:

// Step 1: Parse entire CSV into memory
Papa.parse(file, {
  complete: (results) => {
    const uploadedFile = {
      data: results.data, // ❌ 219k rows in memory
      columns: results.meta.fields,
      rowCount: results.data.length,
    };
    setOriginalFile(uploadedFile);
  }
});

// Step 2: Convert to CSV string (500MB+)
function convertToCSV(file: UploadedFile): string {
  return Papa.unparse(file.data, { // ❌ Creates massive string
    header: true,
    columns: file.columns,
  });
}

// Step 3: Send through tRPC (1.3GB JSON payload)
const csvContent = convertToCSV(file);
await uploadMutation.mutateAsync({
  fileName: file.name,
  csvContent, // ❌ Entire file as string
});

Why This Fails

The browser allocates memory in this sequence:

  1. File read: 50MB (raw CSV file)
  2. Parse result: 300MB (JavaScript objects for 219k rows)
  3. CSV string: 500MB (Papa.unparse output)
  4. tRPC payload: 650MB (JSON serialization overhead)
  5. Total peak usage: 1.5GB+ → Browser crashes

Warning Signs

Your feature has this anti-pattern if you see:

  • Array/object with length > 10,000 stored in React state
  • Papa.parse() without chunk or preview options
  • tRPC mutations accepting string parameters > 1MB
  • Browser DevTools showing memory usage > 500MB
  • "Page Unresponsive" dialogs during data processing

The Streaming Pattern (Recommended)

The streaming pattern processes data in small chunks, keeping memory usage constant regardless of dataset size.

Chunked Processing

Process large datasets in fixed-size chunks to avoid memory spikes:

async function processLargeDataset(data: any[], chunkSize = 10000) {
  const results: any[] = [];
  
  for (let i = 0; i < data.length; i += chunkSize) {
    // Process chunk
    const chunk = data.slice(i, Math.min(i + chunkSize, data.length));
    const processed = await processChunk(chunk);
    results.push(...processed);
    
    // Report progress
    const progress = ((i + chunk.length) / data.length) * 100;
    onProgress?.(progress);
    
    // Yield to browser (critical!)
    await new Promise(resolve => setTimeout(resolve, 0));
  }
  
  return results;
}

Key Points:

  • Chunk size: 10,000 rows is a good default for tabular data
  • Yield to browser: setTimeout(resolve, 0) allows UI updates between chunks
  • Progress tracking: Report progress to show users the operation is active

Blob-Based File Generation

When generating large files, use Blob instead of concatenating strings:

async function generateLargeCSV(data: any[], columns: string[]) {
  const chunks: string[] = [];
  
  // Add header
  chunks.push(columns.join(',') + '\n');
  
  // Process in chunks
  for (let i = 0; i < data.length; i += 10000) {
    const chunkData = data.slice(i, i + 10000);
    const csvChunk = Papa.unparse(chunkData, { header: false });
    chunks.push(csvChunk + '\n');
    
    await new Promise(resolve => setTimeout(resolve, 0));
  }
  
  // Create Blob (browser optimized)
  return new Blob(chunks, { type: 'text/csv' });
}

Why Blob is Better:

  • Browser handles memory management internally
  • Can be sent directly via FormData
  • Supports streaming uploads
  • No intermediate string concatenation

File Upload Architecture

For features that upload large files, follow this architecture to avoid memory issues.

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│ Browser                                                     │
│                                                             │
│  1. User selects file                                       │
│  2. Parse metadata only (first 100 rows)                    │
│  3. Generate CSV as Blob (chunked)                          │
│  4. Upload via FormData (HTTP endpoint)                     │
│  5. Track progress with XMLHttpRequest                      │
│                                                             │
└─────────────────────────────────────────────────────────────┘
                          │
                          │ FormData (streaming)
                          ▼
┌─────────────────────────────────────────────────────────────┐
│ Server (HTTP Endpoint)                                      │
│                                                             │
│  1. Receive multipart/form-data                             │
│  2. Save to temp file (formidable)                          │
│  3. Upload to S3 (storagePut)                               │
│  4. Return S3 key + metadata                                │
│  5. Clean up temp file                                      │
│                                                             │
└─────────────────────────────────────────────────────────────┘
                          │
                          │ S3 key only
                          ▼
┌─────────────────────────────────────────────────────────────┐
│ Backend Processing (tRPC)                                   │
│                                                             │
│  1. Receive S3 key via tRPC                                 │
│  2. Download file from S3 (streaming)                       │
│  3. Process in chunks                                       │
│  4. Upload results to S3                                    │
│  5. Return result S3 key                                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Implementation Steps

Step 1: Create HTTP Upload Endpoint

Create a dedicated HTTP endpoint that accepts FormData uploads:

// server/_core/fileUploadEndpoint.ts
import formidable from 'formidable';
import { storagePut } from '../storage';

export async function handleFileUpload(req: Request, res: Response) {
  const form = formidable({
    maxFileSize: 2 * 1024 * 1024 * 1024, // 2GB
    keepExtensions: true,
  });

  const [fields, files] = await form.parse(req);
  const file = files.file[0];
  
  // Upload to S3
  const s3Key = `uploads/${Date.now()}-${file.originalFilename}`;
  const buffer = await fs.readFile(file.filepath);
  const result = await storagePut(s3Key, buffer, file.mimetype);
  
  // Clean up temp file
  await fs.unlink(file.filepath);
  
  res.json({
    success: true,
    key: result.key,
    url: result.url,
  });
}

Step 2: Register Endpoint

// server/_core/index.ts
import { handleFileUpload } from './fileUploadEndpoint.js';

app.post('/api/upload/file', handleFileUpload);

Step 3: Client-Side Upload

// client/src/lib/fileUpload.ts
export async function uploadFile(
  file: File,
  onProgress?: (percent: number) => void
): Promise<{ key: string; url: string }> {
  return new Promise((resolve, reject) => {
    const formData = new FormData();
    formData.append('file', file);
    
    const xhr = new XMLHttpRequest();
    
    xhr.upload.addEventListener('progress', (e) => {
      if (e.lengthComputable) {
        const percent = (e.loaded / e.total) * 100;
        onProgress?.(percent);
      }
    });
    
    xhr.addEventListener('load', () => {
      if (xhr.status >= 200 && xhr.status < 300) {
        const response = JSON.parse(xhr.responseText);
        resolve({ key: response.key, url: response.url });
      } else {
        reject(new Error(`Upload failed: ${xhr.statusText}`));
      }
    });
    
    xhr.open('POST', '/api/upload/file');
    xhr.send(formData);
  });
}

Step 4: tRPC Backend Processing

// server/router.ts
export const router = {
  processFile: publicProcedure
    .input(z.object({ s3Key: z.string() }))
    .mutation(async ({ input }) => {
      // Download from S3
      const fileContent = await storageGet(input.s3Key);
      
      // Process in chunks
      const results = await processInChunks(fileContent);
      
      // Upload results to S3
      const resultKey = await uploadResults(results);
      
      return { resultKey };
    }),
};

Data Processing Architecture

When processing large datasets, follow these patterns to ensure efficiency.

Pattern 1: Metadata-First Loading

Only load metadata initially, fetch full data on-demand:

interface FileMetadata {
  id: string;
  name: string;
  rowCount: number;
  columns: string[];
  s3Key: string; // Reference to full data
}

// Load metadata only
async function loadFileMetadata(file: File): Promise<FileMetadata> {
  return new Promise((resolve) => {
    Papa.parse(file, {
      preview: 100, // Only first 100 rows
      header: true,
      complete: (results) => {
        resolve({
          id: generateId(),
          name: file.name,
          rowCount: estimateRowCount(file.size, results.data.length),
          columns: results.meta.fields || [],
          s3Key: '', // Will be set after upload
        });
      },
    });
  });
}

Pattern 2: Lazy Loading Sample Data

Load sample data only when needed for preview/matching:

async function loadSampleData(
  s3Key: string,
  maxRows: number = 100
): Promise<any[]> {
  const response = await fetch(`/api/storage/download?key=${s3Key}`);
  const csvText = await response.text();
  
  return new Promise((resolve) => {
    Papa.parse(csvText, {
      header: true,
      preview: maxRows, // Only parse first N rows
      complete: (results) => resolve(results.data),
    });
  });
}

Pattern 3: Server-Side Batch Processing

For heavy computations, process server-side with job queue:

// Client: Submit job
const jobId = await trpc.jobs.submit.mutate({
  s3Key: fileMetadata.s3Key,
  operation: 'normalize',
});

// Server: Process in background
async function processJob(jobId: number) {
  const job = await getJob(jobId);
  const fileContent = await storageGet(job.s3Key);
  
  // Process in chunks
  const CHUNK_SIZE = 10000;
  for (let i = 0; i < totalRows; i += CHUNK_SIZE) {
    const chunk = await readChunk(fileContent, i, CHUNK_SIZE);
    const processed = await processChunk(chunk);
    await saveChunk(processed);
    
    // Update progress
    await updateJobProgress(jobId, (i / totalRows) * 100);
  }
  
  await markJobComplete(jobId);
}

API Design Patterns

Design APIs that encourage memory-efficient usage.

Pattern 1: Separate Upload and Processing

Don't combine file upload and processing in one endpoint:

// ❌ Bad: Upload and process together
processFile: publicProcedure
  .input(z.object({ csvContent: z.string() })) // Entire file as string!
  .mutation(async ({ input }) => {
    const parsed = Papa.parse(input.csvContent);
    return processData(parsed.data);
  });

// ✅ Good: Separate upload and processing
uploadFile: publicProcedure
  .input(z.object({ s3Key: z.string() })) // Just reference
  .mutation(async ({ input }) => {
    return { s3Key: input.s3Key };
  });

processFile: publicProcedure
  .input(z.object({ s3Key: z.string() }))
  .mutation(async ({ input }) => {
    const content = await storageGet(input.s3Key);
    return processData(content);
  });

Pattern 2: Pagination for Large Result Sets

Return paginated results instead of entire datasets:

listResults: publicProcedure
  .input(z.object({
    jobId: z.number(),
    page: z.number().default(1),
    pageSize: z.number().default(100),
  }))
  .query(async ({ input }) => {
    const offset = (input.page - 1) * input.pageSize;
    const results = await db.results
      .where({ jobId: input.jobId })
      .limit(input.pageSize)
      .offset(offset);
    
    return {
      data: results,
      page: input.page,
      totalPages: Math.ceil(totalCount / input.pageSize),
    };
  });

Pattern 3: Streaming Responses

For real-time updates, use WebSocket or Server-Sent Events:

// Server: Emit progress updates
io.on('connection', (socket) => {
  socket.on('subscribe:job', (jobId) => {
    const interval = setInterval(async () => {
      const progress = await getJobProgress(jobId);
      socket.emit('job:progress', progress);
      
      if (progress.status === 'completed') {
        clearInterval(interval);
      }
    }, 1000);
  });
});

// Client: Listen for updates
useEffect(() => {
  socket.on('job:progress', (progress) => {
    setProgress(progress.percent);
  });
}, []);

Testing for Memory Leaks

Use these techniques to detect memory issues before they reach production.

Chrome DevTools Memory Profiling

Steps:

  1. Open Chrome DevTools → Performance → Memory
  2. Click "Record allocation timeline"
  3. Perform the operation (e.g., upload file)
  4. Stop recording
  5. Analyze memory graph

Red Flags:

  • Memory usage increases linearly with data size
  • Memory doesn't return to baseline after operation
  • Sawtooth pattern (rapid allocation/deallocation)

Heap Snapshot Comparison

Steps:

  1. Take heap snapshot before operation
  2. Perform operation
  3. Take heap snapshot after operation
  4. Compare snapshots to find retained objects

Look for:

  • Large arrays/objects not being garbage collected
  • Closures retaining references to large datasets
  • Event listeners not being cleaned up

Automated Memory Tests

Add memory tests to your test suite:

describe('Memory usage', () => {
  it('should not exceed 500MB when processing 100k rows', async () => {
    const initialMemory = performance.memory.usedJSHeapSize;
    
    await processLargeDataset(generate100kRows());
    
    const finalMemory = performance.memory.usedJSHeapSize;
    const memoryIncrease = (finalMemory - initialMemory) / 1024 / 1024;
    
    expect(memoryIncrease).toBeLessThan(500); // 500MB limit
  });
});

Checklist for New Features

Use this checklist when building features that handle large datasets.

Before Implementation

  • Estimate maximum dataset size (rows × columns)
  • Calculate memory requirements (avg cell size × total cells)
  • Decide: client-side or server-side processing?
  • Choose upload method: tRPC (< 10MB) or HTTP FormData (> 10MB)
  • Design chunking strategy (chunk size, progress tracking)

During Implementation

  • Implement chunked processing (10k rows default)
  • Add setTimeout(0) between chunks to yield to browser
  • Use Blob for large file generation
  • Implement progress tracking (0-100%)
  • Add error handling and retry logic
  • Test with realistic dataset sizes

After Implementation

  • Profile memory usage with Chrome DevTools
  • Verify memory returns to baseline after operation
  • Test with 2x expected dataset size
  • Add automated memory tests
  • Document memory requirements in code comments

Code Review Checklist

  • No arrays/objects with length > 10,000 in React state
  • All file uploads use HTTP FormData (not tRPC for > 10MB)
  • All loops processing > 1,000 items include yield points
  • All tRPC mutations have reasonable input size limits
  • All large result sets are paginated
  • Progress tracking implemented for operations > 5 seconds

Real-World Example: CRM Sync Mapper

Let's walk through how the CRM Sync Mapper was refactored to follow these patterns.

Before: Memory Leak

// ❌ Loaded 219k rows into memory
const uploadedFile = {
  data: results.data, // 5.7M cells
  columns: results.meta.fields,
  rowCount: results.data.length,
};

// ❌ Sent entire dataset through tRPC
const csvContent = Papa.unparse(uploadedFile.data);
await uploadMutation({ csvContent }); // 1.3GB payload

Memory Usage: 1.5GB peak → Browser crash

After: Streaming Architecture

// ✅ Step 1: Generate CSV as Blob (chunked)
async function generateCSVBlob(data: any[], columns: string[]) {
  const chunks: string[] = [columns.join(',') + '\n'];
  
  for (let i = 0; i < data.length; i += 10000) {
    const chunkData = data.slice(i, i + 10000);
    chunks.push(Papa.unparse(chunkData, { header: false }) + '\n');
    await new Promise(resolve => setTimeout(resolve, 0));
  }
  
  return new Blob(chunks, { type: 'text/csv' });
}

// ✅ Step 2: Upload via HTTP FormData
const blob = await generateCSVBlob(data, columns);
const formData = new FormData();
formData.append('file', blob, 'data.csv');

const xhr = new XMLHttpRequest();
xhr.open('POST', '/api/upload/file');
xhr.send(formData);

// ✅ Step 3: Backend receives S3 key only
await trpc.processFile.mutate({ s3Key: uploadedKey });

Memory Usage: 200MB peak → Smooth operation

Key Improvements

Aspect Before After Improvement
Peak memory 1.5GB 200MB 87% reduction
Upload size 1.3GB (JSON) 50MB (FormData) 96% reduction
Browser responsiveness Frozen Smooth 100% improvement
Processing time N/A (crashed) 30 seconds Completion

Summary

Building memory-efficient features requires careful architectural planning. The key principles are:

  1. Never load entire datasets into browser memory - Use chunking and streaming
  2. Use HTTP FormData for large files - Avoid tRPC JSON serialization
  3. Process server-side when possible - Leverage server resources
  4. Test with realistic data sizes - Profile memory usage early

By following these patterns, you can build features that handle millions of rows without performance degradation or browser crashes.

For questions or improvements to this guide, please refer to the MEMORY_LEAK_FIX.md document or consult the development team.


Document Version: 1.0
Last Updated: November 16, 2025
Maintained by: Manus AI