Skip to content

Latest commit

 

History

History
425 lines (333 loc) · 16.2 KB

File metadata and controls

425 lines (333 loc) · 16.2 KB

CRM Merge Backend Architecture Comparison

Date: 2025-11-18
Context: Current CSV-based processing crashes with 200k+ rows due to memory exhaustion
Question: What's the best long-term architecture for scalable CRM data merging?


Executive Summary

After analyzing multiple approaches, DuckDB with incremental processing emerges as the optimal solution for this use case. It combines the best of both worlds: SQL-based joins for correctness and streaming capabilities for memory efficiency.

Recommendation: Hybrid DuckDB approach (Option 3 below)


Option 1: CSV Streaming (Current + Improvements)

Architecture

  • Stream CSV files in 10k row chunks using PapaParse
  • Build hash indexes incrementally in memory
  • Match rows chunk-by-chunk against indexes
  • Write output incrementally to S3

Pros

No new dependencies - uses existing PapaParse
Simple deployment - no database setup required
Low operational overhead - stateless processing
Works with existing S3 workflow - direct file processing
Predictable memory usage - controlled by chunk size

Cons

Complex custom logic - manual hash index management, deduplication, conflict resolution
Limited query capabilities - can't do complex joins or aggregations easily
Error recovery is hard - if crash mid-stream, must restart from beginning
Testing complexity - harder to verify correctness of streaming logic
Performance ceiling - JavaScript hash lookups slower than native database indexes
Code maintenance burden - custom streaming logic is brittle and hard to debug

Blind Spots

⚠️ Hash collisions - custom hash indexes may have edge cases
⚠️ Memory leaks - JavaScript GC may not release chunks fast enough under load
⚠️ Partial writes - if S3 upload fails mid-stream, corrupted output
⚠️ No transactional guarantees - can't rollback on error
⚠️ Difficult to add features - any new matching logic requires rewriting streaming code

Scalability

  • Max dataset size: ~1M rows (with careful tuning)
  • Memory usage: 200-500MB (depends on chunk size and hash index size)
  • Processing speed: 1,000-5,000 rows/sec
  • Bottleneck: JavaScript hash lookup performance, GC pauses

Long-term Viability

⚠️ Medium - works but requires ongoing maintenance and has performance ceiling


Option 2: DuckDB with Full In-Memory Processing

Architecture

  • Upload all files to S3
  • Download and import into DuckDB in-memory database
  • Use SQL JOINs for matching and merging
  • Export results to CSV and upload to S3

Pros

SQL-based logic - simple, declarative, easy to understand
Proven correctness - database JOIN semantics are well-tested
Fast joins - columnar storage + vectorized execution
Easy to add features - just write SQL queries
Better testing - can validate with SQL queries
Built-in deduplication - GROUP BY, DISTINCT, window functions
Excellent analytics - can generate match statistics with SQL

Cons

Memory still limited - in-memory mode still loads full dataset
New dependency - adds DuckDB to stack
Deployment complexity - need DuckDB binary in container

Blind Spots

⚠️ Still crashes with huge datasets - in-memory mode has same problem as CSV
⚠️ Wasted work on failure - must re-import all data if crash mid-query

Scalability

  • Max dataset size: ~500k-1M rows (in-memory limit)
  • Memory usage: 500MB-2GB (depends on column count)
  • Processing speed: 10,000-50,000 rows/sec (much faster than CSV)
  • Bottleneck: Memory capacity

Long-term Viability

⚠️ Medium - better than CSV but still has memory ceiling


Option 3: DuckDB with Persistent Database + Streaming (RECOMMENDED)

Architecture

  1. Upload Phase: Client uploads all files at once (CRM + enrichments)
  2. Import Phase: Server streams CSVs into DuckDB persistent database file
  3. Processing Phase: Use SQL JOINs on disk-backed tables (spills to disk if needed)
  4. Export Phase: Stream results back to CSV and upload to S3
  5. Cleanup: Delete DuckDB file after job completes

Pros

Unlimited dataset size - DuckDB automatically spills to disk
Memory efficient - only keeps working set in memory
SQL-based logic - simple, declarative, maintainable
Incremental processing - can checkpoint progress
Crash recovery - can resume from last checkpoint
Fast - columnar storage + vectorized execution
Easy to test - can inspect intermediate tables
Built-in optimizations - query planner, indexes, statistics
Future-proof - can add complex analytics without rewriting core logic

Cons

Disk I/O required - needs temp storage (but ephemeral disk is cheap)
New dependency - DuckDB binary (~50MB)
Slightly more complex - need to manage temp database files

Implementation Details

// Pseudo-code for DuckDB approach
async function processCRMMerge(jobData) {
  const dbPath = `/tmp/crm-merge-${jobData.jobId}.duckdb`;
  const db = new DuckDB(dbPath);
  
  try {
    // Phase 1: Import all files (streaming)
    await db.execute(`
      CREATE TABLE original AS 
      SELECT * FROM read_csv_auto('${jobData.originalFile.s3Url}')
    `);
    
    for (const enrichedFile of jobData.enrichedFiles) {
      await db.execute(`
        CREATE TABLE enriched_${enrichedFile.id} AS 
        SELECT * FROM read_csv_auto('${enrichedFile.s3Url}')
      `);
    }
    
    // Phase 2: Consolidate enriched files (SQL handles deduplication)
    await db.execute(`
      CREATE TABLE master_enriched AS
      SELECT 
        COALESCE(e1.email, e2.email, e3.email) as email,
        COALESCE(e1.phone, e2.phone, e3.phone) as phone,
        -- ... conflict resolution logic in SQL
      FROM enriched_1 e1
      FULL OUTER JOIN enriched_2 e2 ON e1.email = e2.email
      FULL OUTER JOIN enriched_3 e3 ON e2.email = e3.email
      GROUP BY email  -- deduplication
    `);
    
    // Phase 3: Match and merge (single SQL query!)
    await db.execute(`
      CREATE TABLE merged AS
      SELECT 
        o.*,
        e.enriched_data
      FROM original o
      LEFT JOIN master_enriched e 
        ON o.email = e.email 
        OR o.phone = e.phone
    `);
    
    // Phase 4: Export results (streaming)
    const outputStream = db.stream(`SELECT * FROM merged`);
    const csvContent = await streamToCSV(outputStream);
    await uploadToS3(csvContent, jobData.outputKey);
    
  } finally {
    db.close();
    fs.unlinkSync(dbPath); // Cleanup
  }
}

Blind Spots

⚠️ Disk space - needs temp storage (but can clean up after job)
⚠️ DuckDB version compatibility - need to pin version to avoid breaking changes
⚠️ Binary distribution - need DuckDB CLI or Node.js bindings in container

Scalability

  • Max dataset size: Unlimited (DuckDB handles 100M+ rows on disk)
  • Memory usage: 100-500MB (only working set in memory)
  • Processing speed: 50,000-200,000 rows/sec (columnar + vectorized)
  • Bottleneck: Disk I/O (but much faster than network I/O for S3)

Long-term Viability

Excellent - production-ready, battle-tested, actively maintained


Option 4: PostgreSQL with Temp Tables

Architecture

  • Use existing PostgreSQL database
  • CREATE TEMP TABLE for each file
  • Use SQL JOINs for matching
  • Stream results back to CSV

Pros

No new dependencies - already have PostgreSQL
SQL-based logic - same benefits as DuckDB
Transactional - ACID guarantees
Familiar - team already knows PostgreSQL

Cons

Slower than DuckDB - row-based storage, not optimized for analytics
Connection overhead - network latency to database
Resource contention - shares resources with production app
Temp table cleanup - need to manage temp table lifecycle
Not designed for bulk analytics - PostgreSQL is OLTP, not OLAP

Blind Spots

⚠️ Production impact - heavy analytics queries could slow down app
⚠️ Connection pool exhaustion - long-running jobs tie up connections
⚠️ Vacuum overhead - temp tables create bloat

Scalability

  • Max dataset size: ~1M rows (before impacting production)
  • Memory usage: Depends on shared_buffers config
  • Processing speed: 5,000-20,000 rows/sec (slower than DuckDB)
  • Bottleneck: Network I/O, row-based storage

Long-term Viability

⚠️ Low - not designed for this use case, will cause operational issues


Option 5: Apache Arrow + Parquet

Architecture

  • Convert CSVs to Parquet format
  • Use Arrow for in-memory columnar processing
  • Stream processing with Arrow Flight

Pros

Extremely fast - zero-copy, columnar, vectorized
Efficient serialization - Parquet is highly compressed
Language-agnostic - Arrow has bindings for many languages

Cons

Complex - steep learning curve, lots of boilerplate
No SQL - must write imperative code for joins
Immature ecosystem - fewer examples, harder to debug
Overkill - designed for distributed systems, not single-node jobs

Blind Spots

⚠️ Development time - would take weeks to implement correctly
⚠️ Maintenance burden - complex code, hard to onboard new developers

Scalability

  • Max dataset size: Very large (10M+ rows)
  • Memory usage: Low (columnar + compression)
  • Processing speed: Very fast (100k+ rows/sec)
  • Bottleneck: Development complexity

Long-term Viability

⚠️ Low for this use case - too complex, not worth the effort


Option 6: Serverless Queue (BullMQ + Redis)

Architecture

  • Break job into small tasks (1000 rows each)
  • Queue tasks in Redis (BullMQ)
  • Process tasks in parallel workers
  • Aggregate results

Pros

Horizontal scaling - can add more workers
Fault tolerance - tasks can retry on failure
Progress tracking - built-in job progress
Rate limiting - can throttle to avoid overload

Cons

Complexity - distributed system complexity (race conditions, ordering, aggregation)
Redis dependency - need Redis for queue
Result aggregation - hard to merge partial results correctly
Ordering issues - output may not preserve row order
Overhead - serialization/deserialization for each task

Blind Spots

⚠️ Partial results - if some tasks fail, how to handle?
⚠️ Duplicate processing - task retries may process same rows twice
⚠️ Memory still an issue - each worker still needs to load data

Scalability

  • Max dataset size: Large (distributed processing)
  • Memory usage: Per-worker (still need to solve memory issue)
  • Processing speed: Depends on worker count
  • Bottleneck: Coordination overhead, result aggregation

Long-term Viability

⚠️ Medium - good for some use cases, but adds operational complexity


Comparison Matrix

Criteria CSV Stream DuckDB In-Mem DuckDB Persistent PostgreSQL Arrow/Parquet BullMQ Queue
Max Dataset 1M rows 1M rows Unlimited 1M rows 10M+ rows Large
Memory Usage 200-500MB 500MB-2GB 100-500MB Varies Low Per-worker
Speed 1-5k/sec 10-50k/sec 50-200k/sec 5-20k/sec 100k+/sec Varies
Complexity High Low Low Medium Very High High
Maintenance High Low Low Medium Very High Medium
Crash Recovery ❌ None ❌ None ✅ Checkpoint ⚠️ Partial ⚠️ Partial ✅ Retry
SQL Support ❌ No ✅ Yes ✅ Yes ✅ Yes ❌ No ❌ No
New Dependencies None DuckDB DuckDB None Arrow Redis
Operational Risk Medium Low Low High High Medium
Future-Proof ⚠️ Medium ⚠️ Medium ✅ Excellent ❌ Low ⚠️ Medium ⚠️ Medium

Recommendation: DuckDB Persistent (Option 3)

Why DuckDB?

DuckDB is specifically designed for this exact use case:

  • Embedded OLAP database - no server, no network overhead
  • Columnar storage - optimized for analytics queries (JOINs, aggregations)
  • Automatic spilling - seamlessly uses disk when memory is full
  • SQL interface - simple, declarative, easy to maintain
  • Production-ready - used by companies like Motherduck, ClickHouse Cloud
  • Active development - backed by DuckDB Labs, frequent releases

Implementation Plan

  1. Phase 1: Single Upload UX (1-2 days)

    • Update UI to accept all files at once (CRM + enrichments)
    • Upload all files to S3 in parallel
    • Create job with all file metadata
  2. Phase 2: DuckDB Integration (2-3 days)

    • Install duckdb npm package
    • Create DuckDB service wrapper
    • Implement CSV import using read_csv_auto()
    • Implement SQL-based matching logic
    • Implement streaming export
  3. Phase 3: Testing & Optimization (1-2 days)

    • Test with 200k row dataset
    • Monitor memory usage
    • Optimize SQL queries
    • Add progress tracking
  4. Phase 4: Production Deployment (1 day)

    • Update Docker container with DuckDB binary
    • Add temp disk cleanup cron job
    • Monitor production metrics

Migration Path

Backward compatibility:

  • Keep existing CSV upload flow for now
  • Add new "Batch Upload" option
  • Gradually migrate users to new flow
  • Deprecate old flow after 3 months

Cost Analysis

Development time: 5-8 days
Ongoing maintenance: Low (SQL is stable, DuckDB is mature)
Infrastructure cost: $0 (uses ephemeral disk, no new services)
Risk: Low (can rollback to old flow if issues)


Addressing User's Specific Questions

Q: "Is DuckDB better for scalability?"

A: Yes, significantly. DuckDB can handle 100M+ rows on disk, while CSV streaming tops out around 1M rows.

Q: "Is it more efficient for long-term scale?"

A: Yes. SQL-based logic is easier to maintain and optimize than custom streaming code. DuckDB's query planner will automatically optimize queries as the codebase evolves.

Q: "What are the blind spots?"

A: Main blind spots:

  1. Disk space management - need to clean up temp files (mitigated with cron job)
  2. DuckDB version compatibility - need to pin version (mitigated with package.json lock)
  3. Binary distribution - need DuckDB in container (mitigated with Dockerfile)

Q: "What big negatives could there be?"

A: Potential negatives:

  1. Learning curve - team needs to learn DuckDB-specific SQL (but it's 95% standard SQL)
  2. Debugging - harder to debug SQL queries than imperative code (mitigated with EXPLAIN ANALYZE)
  3. Vendor lock-in - if DuckDB project dies, need to migrate (unlikely - project is well-funded and growing)

Q: "Is there an even better alternative?"

A: For this specific use case (CRM data merging with 200k+ rows), DuckDB is the best option. The only "better" alternative would be a fully managed service like Snowflake or BigQuery, but that adds significant cost and operational complexity.


Conclusion

Implement DuckDB Persistent (Option 3) for the following reasons:

  1. Solves the immediate problem - no more memory crashes
  2. Future-proof - handles unlimited dataset sizes
  3. Low complexity - SQL is simpler than streaming logic
  4. Fast - 10-40x faster than CSV streaming
  5. Maintainable - easy to add features, easy to debug
  6. Low risk - can rollback to old flow if needed
  7. No infrastructure changes - just add one npm package

Next Steps:

  1. Get user approval on this approach
  2. Implement Phase 1 (single upload UX)
  3. Implement Phase 2 (DuckDB integration)
  4. Test with 200k row dataset
  5. Deploy to production
  6. Monitor and optimize

References