CRM Merge Backend Architecture Comparison

Date: 2025-11-18
Context: Current CSV-based processing crashes with 200k+ rows due to memory exhaustion
Question: What's the best long-term architecture for scalable CRM data merging?

Executive Summary

After analyzing multiple approaches, DuckDB with incremental processing emerges as the optimal solution for this use case. It combines the best of both worlds: SQL-based joins for correctness and streaming capabilities for memory efficiency.

Recommendation: Hybrid DuckDB approach (Option 3 below)

Option 1: CSV Streaming (Current + Improvements)

Architecture

Stream CSV files in 10k row chunks using PapaParse
Build hash indexes incrementally in memory
Match rows chunk-by-chunk against indexes
Write output incrementally to S3

Pros

✅ No new dependencies - uses existing PapaParse
✅ Simple deployment - no database setup required
✅ Low operational overhead - stateless processing
✅ Works with existing S3 workflow - direct file processing
✅ Predictable memory usage - controlled by chunk size

Cons

❌ Complex custom logic - manual hash index management, deduplication, conflict resolution
❌ Limited query capabilities - can't do complex joins or aggregations easily
❌ Error recovery is hard - if crash mid-stream, must restart from beginning
❌ Testing complexity - harder to verify correctness of streaming logic
❌ Performance ceiling - JavaScript hash lookups slower than native database indexes
❌ Code maintenance burden - custom streaming logic is brittle and hard to debug

Blind Spots

⚠️ Hash collisions - custom hash indexes may have edge cases
⚠️ Memory leaks - JavaScript GC may not release chunks fast enough under load
⚠️ Partial writes - if S3 upload fails mid-stream, corrupted output
⚠️ No transactional guarantees - can't rollback on error
⚠️ Difficult to add features - any new matching logic requires rewriting streaming code

Scalability

Max dataset size: ~1M rows (with careful tuning)
Memory usage: 200-500MB (depends on chunk size and hash index size)
Processing speed: 1,000-5,000 rows/sec
Bottleneck: JavaScript hash lookup performance, GC pauses

Long-term Viability

⚠️ Medium - works but requires ongoing maintenance and has performance ceiling

Option 2: DuckDB with Full In-Memory Processing

Architecture

Upload all files to S3
Download and import into DuckDB in-memory database
Use SQL JOINs for matching and merging
Export results to CSV and upload to S3

Pros

✅ SQL-based logic - simple, declarative, easy to understand
✅ Proven correctness - database JOIN semantics are well-tested
✅ Fast joins - columnar storage + vectorized execution
✅ Easy to add features - just write SQL queries
✅ Better testing - can validate with SQL queries
✅ Built-in deduplication - GROUP BY, DISTINCT, window functions
✅ Excellent analytics - can generate match statistics with SQL

Cons

❌ Memory still limited - in-memory mode still loads full dataset
❌ New dependency - adds DuckDB to stack
❌ Deployment complexity - need DuckDB binary in container

Blind Spots

⚠️ Still crashes with huge datasets - in-memory mode has same problem as CSV
⚠️ Wasted work on failure - must re-import all data if crash mid-query

Scalability

Max dataset size: ~500k-1M rows (in-memory limit)
Memory usage: 500MB-2GB (depends on column count)
Processing speed: 10,000-50,000 rows/sec (much faster than CSV)
Bottleneck: Memory capacity

Long-term Viability

⚠️ Medium - better than CSV but still has memory ceiling

Option 3: DuckDB with Persistent Database + Streaming (RECOMMENDED)

Architecture

Upload Phase: Client uploads all files at once (CRM + enrichments)
Import Phase: Server streams CSVs into DuckDB persistent database file
Processing Phase: Use SQL JOINs on disk-backed tables (spills to disk if needed)
Export Phase: Stream results back to CSV and upload to S3
Cleanup: Delete DuckDB file after job completes

Pros

✅ Unlimited dataset size - DuckDB automatically spills to disk
✅ Memory efficient - only keeps working set in memory
✅ SQL-based logic - simple, declarative, maintainable
✅ Incremental processing - can checkpoint progress
✅ Crash recovery - can resume from last checkpoint
✅ Fast - columnar storage + vectorized execution
✅ Easy to test - can inspect intermediate tables
✅ Built-in optimizations - query planner, indexes, statistics
✅ Future-proof - can add complex analytics without rewriting core logic

Cons

❌ Disk I/O required - needs temp storage (but ephemeral disk is cheap)
❌ New dependency - DuckDB binary (~50MB)
❌ Slightly more complex - need to manage temp database files

Implementation Details

// Pseudo-code for DuckDB approach
async function processCRMMerge(jobData) {
  const dbPath = `/tmp/crm-merge-${jobData.jobId}.duckdb`;
  const db = new DuckDB(dbPath);
  
  try {
    // Phase 1: Import all files (streaming)
    await db.execute(`
      CREATE TABLE original AS 
      SELECT * FROM read_csv_auto('${jobData.originalFile.s3Url}')
    `);
    
    for (const enrichedFile of jobData.enrichedFiles) {
      await db.execute(`
        CREATE TABLE enriched_${enrichedFile.id} AS 
        SELECT * FROM read_csv_auto('${enrichedFile.s3Url}')
      `);
    }
    
    // Phase 2: Consolidate enriched files (SQL handles deduplication)
    await db.execute(`
      CREATE TABLE master_enriched AS
      SELECT 
        COALESCE(e1.email, e2.email, e3.email) as email,
        COALESCE(e1.phone, e2.phone, e3.phone) as phone,
        -- ... conflict resolution logic in SQL
      FROM enriched_1 e1
      FULL OUTER JOIN enriched_2 e2 ON e1.email = e2.email
      FULL OUTER JOIN enriched_3 e3 ON e2.email = e3.email
      GROUP BY email  -- deduplication
    `);
    
    // Phase 3: Match and merge (single SQL query!)
    await db.execute(`
      CREATE TABLE merged AS
      SELECT 
        o.*,
        e.enriched_data
      FROM original o
      LEFT JOIN master_enriched e 
        ON o.email = e.email 
        OR o.phone = e.phone
    `);
    
    // Phase 4: Export results (streaming)
    const outputStream = db.stream(`SELECT * FROM merged`);
    const csvContent = await streamToCSV(outputStream);
    await uploadToS3(csvContent, jobData.outputKey);
    
  } finally {
    db.close();
    fs.unlinkSync(dbPath); // Cleanup
  }
}

Blind Spots

⚠️ Disk space - needs temp storage (but can clean up after job)
⚠️ DuckDB version compatibility - need to pin version to avoid breaking changes
⚠️ Binary distribution - need DuckDB CLI or Node.js bindings in container

Scalability

Max dataset size: Unlimited (DuckDB handles 100M+ rows on disk)
Memory usage: 100-500MB (only working set in memory)
Processing speed: 50,000-200,000 rows/sec (columnar + vectorized)
Bottleneck: Disk I/O (but much faster than network I/O for S3)

Long-term Viability

✅ Excellent - production-ready, battle-tested, actively maintained

Option 4: PostgreSQL with Temp Tables

Architecture

Use existing PostgreSQL database
CREATE TEMP TABLE for each file
Use SQL JOINs for matching
Stream results back to CSV

Pros

✅ No new dependencies - already have PostgreSQL
✅ SQL-based logic - same benefits as DuckDB
✅ Transactional - ACID guarantees
✅ Familiar - team already knows PostgreSQL

Cons

❌ Slower than DuckDB - row-based storage, not optimized for analytics
❌ Connection overhead - network latency to database
❌ Resource contention - shares resources with production app
❌ Temp table cleanup - need to manage temp table lifecycle
❌ Not designed for bulk analytics - PostgreSQL is OLTP, not OLAP

Blind Spots

⚠️ Production impact - heavy analytics queries could slow down app
⚠️ Connection pool exhaustion - long-running jobs tie up connections
⚠️ Vacuum overhead - temp tables create bloat

Scalability

Max dataset size: ~1M rows (before impacting production)
Memory usage: Depends on shared_buffers config
Processing speed: 5,000-20,000 rows/sec (slower than DuckDB)
Bottleneck: Network I/O, row-based storage

Long-term Viability

⚠️ Low - not designed for this use case, will cause operational issues

Option 5: Apache Arrow + Parquet

Architecture

Convert CSVs to Parquet format
Use Arrow for in-memory columnar processing
Stream processing with Arrow Flight

Pros

✅ Extremely fast - zero-copy, columnar, vectorized
✅ Efficient serialization - Parquet is highly compressed
✅ Language-agnostic - Arrow has bindings for many languages

Cons

❌ Complex - steep learning curve, lots of boilerplate
❌ No SQL - must write imperative code for joins
❌ Immature ecosystem - fewer examples, harder to debug
❌ Overkill - designed for distributed systems, not single-node jobs

Blind Spots

⚠️ Development time - would take weeks to implement correctly
⚠️ Maintenance burden - complex code, hard to onboard new developers

Scalability

Max dataset size: Very large (10M+ rows)
Memory usage: Low (columnar + compression)
Processing speed: Very fast (100k+ rows/sec)
Bottleneck: Development complexity

Long-term Viability

⚠️ Low for this use case - too complex, not worth the effort

Option 6: Serverless Queue (BullMQ + Redis)

Architecture

Break job into small tasks (1000 rows each)
Queue tasks in Redis (BullMQ)
Process tasks in parallel workers
Aggregate results

Pros

✅ Horizontal scaling - can add more workers
✅ Fault tolerance - tasks can retry on failure
✅ Progress tracking - built-in job progress
✅ Rate limiting - can throttle to avoid overload

Cons

❌ Complexity - distributed system complexity (race conditions, ordering, aggregation)
❌ Redis dependency - need Redis for queue
❌ Result aggregation - hard to merge partial results correctly
❌ Ordering issues - output may not preserve row order
❌ Overhead - serialization/deserialization for each task

Blind Spots

⚠️ Partial results - if some tasks fail, how to handle?
⚠️ Duplicate processing - task retries may process same rows twice
⚠️ Memory still an issue - each worker still needs to load data

Scalability

Max dataset size: Large (distributed processing)
Memory usage: Per-worker (still need to solve memory issue)
Processing speed: Depends on worker count
Bottleneck: Coordination overhead, result aggregation

Long-term Viability

⚠️ Medium - good for some use cases, but adds operational complexity

Comparison Matrix

Criteria	CSV Stream	DuckDB In-Mem	DuckDB Persistent	PostgreSQL	Arrow/Parquet	BullMQ Queue
Max Dataset	1M rows	1M rows	Unlimited	1M rows	10M+ rows	Large
Memory Usage	200-500MB	500MB-2GB	100-500MB	Varies	Low	Per-worker
Speed	1-5k/sec	10-50k/sec	50-200k/sec	5-20k/sec	100k+/sec	Varies
Complexity	High	Low	Low	Medium	Very High	High
Maintenance	High	Low	Low	Medium	Very High	Medium
Crash Recovery	❌ None	❌ None	✅ Checkpoint	⚠️ Partial	⚠️ Partial	✅ Retry
SQL Support	❌ No	✅ Yes	✅ Yes	✅ Yes	❌ No	❌ No
New Dependencies	None	DuckDB	DuckDB	None	Arrow	Redis
Operational Risk	Medium	Low	Low	High	High	Medium
Future-Proof	⚠️ Medium	⚠️ Medium	✅ Excellent	❌ Low	⚠️ Medium	⚠️ Medium

Recommendation: DuckDB Persistent (Option 3)

Why DuckDB?

DuckDB is specifically designed for this exact use case:

Embedded OLAP database - no server, no network overhead
Columnar storage - optimized for analytics queries (JOINs, aggregations)
Automatic spilling - seamlessly uses disk when memory is full
SQL interface - simple, declarative, easy to maintain
Production-ready - used by companies like Motherduck, ClickHouse Cloud
Active development - backed by DuckDB Labs, frequent releases

Implementation Plan

Phase 1: Single Upload UX (1-2 days)
- Update UI to accept all files at once (CRM + enrichments)
- Upload all files to S3 in parallel
- Create job with all file metadata
Phase 2: DuckDB Integration (2-3 days)
- Install duckdb npm package
- Create DuckDB service wrapper
- Implement CSV import using read_csv_auto()
- Implement SQL-based matching logic
- Implement streaming export
Phase 3: Testing & Optimization (1-2 days)
- Test with 200k row dataset
- Monitor memory usage
- Optimize SQL queries
- Add progress tracking
Phase 4: Production Deployment (1 day)
- Update Docker container with DuckDB binary
- Add temp disk cleanup cron job
- Monitor production metrics

Migration Path

Backward compatibility:

Keep existing CSV upload flow for now
Add new "Batch Upload" option
Gradually migrate users to new flow
Deprecate old flow after 3 months

Cost Analysis

Development time: 5-8 days
Ongoing maintenance: Low (SQL is stable, DuckDB is mature)
Infrastructure cost: $0 (uses ephemeral disk, no new services)
Risk: Low (can rollback to old flow if issues)

Addressing User's Specific Questions

Q: "Is DuckDB better for scalability?"

A: Yes, significantly. DuckDB can handle 100M+ rows on disk, while CSV streaming tops out around 1M rows.

Q: "Is it more efficient for long-term scale?"

A: Yes. SQL-based logic is easier to maintain and optimize than custom streaming code. DuckDB's query planner will automatically optimize queries as the codebase evolves.

Q: "What are the blind spots?"

A: Main blind spots:

Disk space management - need to clean up temp files (mitigated with cron job)
DuckDB version compatibility - need to pin version (mitigated with package.json lock)
Binary distribution - need DuckDB in container (mitigated with Dockerfile)

Q: "What big negatives could there be?"

A: Potential negatives:

Learning curve - team needs to learn DuckDB-specific SQL (but it's 95% standard SQL)
Debugging - harder to debug SQL queries than imperative code (mitigated with EXPLAIN ANALYZE)
Vendor lock-in - if DuckDB project dies, need to migrate (unlikely - project is well-funded and growing)

Q: "Is there an even better alternative?"

A: For this specific use case (CRM data merging with 200k+ rows), DuckDB is the best option. The only "better" alternative would be a fully managed service like Snowflake or BigQuery, but that adds significant cost and operational complexity.

Conclusion

Implement DuckDB Persistent (Option 3) for the following reasons:

✅ Solves the immediate problem - no more memory crashes
✅ Future-proof - handles unlimited dataset sizes
✅ Low complexity - SQL is simpler than streaming logic
✅ Fast - 10-40x faster than CSV streaming
✅ Maintainable - easy to add features, easy to debug
✅ Low risk - can rollback to old flow if needed
✅ No infrastructure changes - just add one npm package

Next Steps:

Get user approval on this approach
Implement Phase 1 (single upload UX)
Implement Phase 2 (DuckDB integration)
Test with 200k row dataset
Deploy to production
Monitor and optimize

FilesExpand file tree

ARCHITECTURE_COMPARISON.md

Latest commit

History

ARCHITECTURE_COMPARISON.md

File metadata and controls

CRM Merge Backend Architecture Comparison

Executive Summary

Option 1: CSV Streaming (Current + Improvements)

Architecture

Pros

Cons

Blind Spots

Scalability

Long-term Viability

Option 2: DuckDB with Full In-Memory Processing

Architecture

Pros

Cons

Blind Spots

Scalability

Long-term Viability

Option 3: DuckDB with Persistent Database + Streaming (RECOMMENDED)

Architecture

Pros

Cons

Implementation Details

Blind Spots

Scalability

Long-term Viability

Option 4: PostgreSQL with Temp Tables

Architecture

Pros

Cons

Blind Spots

Scalability

Long-term Viability

Option 5: Apache Arrow + Parquet

Architecture

Pros

Cons

Blind Spots

Scalability

Long-term Viability

Option 6: Serverless Queue (BullMQ + Redis)

Architecture

Pros

Cons

Blind Spots

Scalability

Long-term Viability

Comparison Matrix

Recommendation: DuckDB Persistent (Option 3)

Why DuckDB?

Implementation Plan

Migration Path

Cost Analysis

Addressing User's Specific Questions

Q: "Is DuckDB better for scalability?"

Q: "Is it more efficient for long-term scale?"

Q: "What are the blind spots?"

Q: "What big negatives could there be?"

Q: "Is there an even better alternative?"

Conclusion

References