Date: 2025-11-18
Context: Current CSV-based processing crashes with 200k+ rows due to memory exhaustion
Question: What's the best long-term architecture for scalable CRM data merging?
After analyzing multiple approaches, DuckDB with incremental processing emerges as the optimal solution for this use case. It combines the best of both worlds: SQL-based joins for correctness and streaming capabilities for memory efficiency.
Recommendation: Hybrid DuckDB approach (Option 3 below)
- Stream CSV files in 10k row chunks using PapaParse
- Build hash indexes incrementally in memory
- Match rows chunk-by-chunk against indexes
- Write output incrementally to S3
✅ No new dependencies - uses existing PapaParse
✅ Simple deployment - no database setup required
✅ Low operational overhead - stateless processing
✅ Works with existing S3 workflow - direct file processing
✅ Predictable memory usage - controlled by chunk size
❌ Complex custom logic - manual hash index management, deduplication, conflict resolution
❌ Limited query capabilities - can't do complex joins or aggregations easily
❌ Error recovery is hard - if crash mid-stream, must restart from beginning
❌ Testing complexity - harder to verify correctness of streaming logic
❌ Performance ceiling - JavaScript hash lookups slower than native database indexes
❌ Code maintenance burden - custom streaming logic is brittle and hard to debug
- Max dataset size: ~1M rows (with careful tuning)
- Memory usage: 200-500MB (depends on chunk size and hash index size)
- Processing speed: 1,000-5,000 rows/sec
- Bottleneck: JavaScript hash lookup performance, GC pauses
- Upload all files to S3
- Download and import into DuckDB in-memory database
- Use SQL JOINs for matching and merging
- Export results to CSV and upload to S3
✅ SQL-based logic - simple, declarative, easy to understand
✅ Proven correctness - database JOIN semantics are well-tested
✅ Fast joins - columnar storage + vectorized execution
✅ Easy to add features - just write SQL queries
✅ Better testing - can validate with SQL queries
✅ Built-in deduplication - GROUP BY, DISTINCT, window functions
✅ Excellent analytics - can generate match statistics with SQL
❌ Memory still limited - in-memory mode still loads full dataset
❌ New dependency - adds DuckDB to stack
❌ Deployment complexity - need DuckDB binary in container
- Max dataset size: ~500k-1M rows (in-memory limit)
- Memory usage: 500MB-2GB (depends on column count)
- Processing speed: 10,000-50,000 rows/sec (much faster than CSV)
- Bottleneck: Memory capacity
- Upload Phase: Client uploads all files at once (CRM + enrichments)
- Import Phase: Server streams CSVs into DuckDB persistent database file
- Processing Phase: Use SQL JOINs on disk-backed tables (spills to disk if needed)
- Export Phase: Stream results back to CSV and upload to S3
- Cleanup: Delete DuckDB file after job completes
✅ Unlimited dataset size - DuckDB automatically spills to disk
✅ Memory efficient - only keeps working set in memory
✅ SQL-based logic - simple, declarative, maintainable
✅ Incremental processing - can checkpoint progress
✅ Crash recovery - can resume from last checkpoint
✅ Fast - columnar storage + vectorized execution
✅ Easy to test - can inspect intermediate tables
✅ Built-in optimizations - query planner, indexes, statistics
✅ Future-proof - can add complex analytics without rewriting core logic
❌ Disk I/O required - needs temp storage (but ephemeral disk is cheap)
❌ New dependency - DuckDB binary (~50MB)
❌ Slightly more complex - need to manage temp database files
// Pseudo-code for DuckDB approach
async function processCRMMerge(jobData) {
const dbPath = `/tmp/crm-merge-${jobData.jobId}.duckdb`;
const db = new DuckDB(dbPath);
try {
// Phase 1: Import all files (streaming)
await db.execute(`
CREATE TABLE original AS
SELECT * FROM read_csv_auto('${jobData.originalFile.s3Url}')
`);
for (const enrichedFile of jobData.enrichedFiles) {
await db.execute(`
CREATE TABLE enriched_${enrichedFile.id} AS
SELECT * FROM read_csv_auto('${enrichedFile.s3Url}')
`);
}
// Phase 2: Consolidate enriched files (SQL handles deduplication)
await db.execute(`
CREATE TABLE master_enriched AS
SELECT
COALESCE(e1.email, e2.email, e3.email) as email,
COALESCE(e1.phone, e2.phone, e3.phone) as phone,
-- ... conflict resolution logic in SQL
FROM enriched_1 e1
FULL OUTER JOIN enriched_2 e2 ON e1.email = e2.email
FULL OUTER JOIN enriched_3 e3 ON e2.email = e3.email
GROUP BY email -- deduplication
`);
// Phase 3: Match and merge (single SQL query!)
await db.execute(`
CREATE TABLE merged AS
SELECT
o.*,
e.enriched_data
FROM original o
LEFT JOIN master_enriched e
ON o.email = e.email
OR o.phone = e.phone
`);
// Phase 4: Export results (streaming)
const outputStream = db.stream(`SELECT * FROM merged`);
const csvContent = await streamToCSV(outputStream);
await uploadToS3(csvContent, jobData.outputKey);
} finally {
db.close();
fs.unlinkSync(dbPath); // Cleanup
}
}
- Max dataset size: Unlimited (DuckDB handles 100M+ rows on disk)
- Memory usage: 100-500MB (only working set in memory)
- Processing speed: 50,000-200,000 rows/sec (columnar + vectorized)
- Bottleneck: Disk I/O (but much faster than network I/O for S3)
✅ Excellent - production-ready, battle-tested, actively maintained
- Use existing PostgreSQL database
- CREATE TEMP TABLE for each file
- Use SQL JOINs for matching
- Stream results back to CSV
✅ No new dependencies - already have PostgreSQL
✅ SQL-based logic - same benefits as DuckDB
✅ Transactional - ACID guarantees
✅ Familiar - team already knows PostgreSQL
❌ Slower than DuckDB - row-based storage, not optimized for analytics
❌ Connection overhead - network latency to database
❌ Resource contention - shares resources with production app
❌ Temp table cleanup - need to manage temp table lifecycle
❌ Not designed for bulk analytics - PostgreSQL is OLTP, not OLAP
- Max dataset size: ~1M rows (before impacting production)
- Memory usage: Depends on shared_buffers config
- Processing speed: 5,000-20,000 rows/sec (slower than DuckDB)
- Bottleneck: Network I/O, row-based storage
- Convert CSVs to Parquet format
- Use Arrow for in-memory columnar processing
- Stream processing with Arrow Flight
✅ Extremely fast - zero-copy, columnar, vectorized
✅ Efficient serialization - Parquet is highly compressed
✅ Language-agnostic - Arrow has bindings for many languages
❌ Complex - steep learning curve, lots of boilerplate
❌ No SQL - must write imperative code for joins
❌ Immature ecosystem - fewer examples, harder to debug
❌ Overkill - designed for distributed systems, not single-node jobs
- Max dataset size: Very large (10M+ rows)
- Memory usage: Low (columnar + compression)
- Processing speed: Very fast (100k+ rows/sec)
- Bottleneck: Development complexity
- Break job into small tasks (1000 rows each)
- Queue tasks in Redis (BullMQ)
- Process tasks in parallel workers
- Aggregate results
✅ Horizontal scaling - can add more workers
✅ Fault tolerance - tasks can retry on failure
✅ Progress tracking - built-in job progress
✅ Rate limiting - can throttle to avoid overload
❌ Complexity - distributed system complexity (race conditions, ordering, aggregation)
❌ Redis dependency - need Redis for queue
❌ Result aggregation - hard to merge partial results correctly
❌ Ordering issues - output may not preserve row order
❌ Overhead - serialization/deserialization for each task
- Max dataset size: Large (distributed processing)
- Memory usage: Per-worker (still need to solve memory issue)
- Processing speed: Depends on worker count
- Bottleneck: Coordination overhead, result aggregation
| Criteria | CSV Stream | DuckDB In-Mem | DuckDB Persistent | PostgreSQL | Arrow/Parquet | BullMQ Queue |
|---|---|---|---|---|---|---|
| Max Dataset | 1M rows | 1M rows | Unlimited | 1M rows | 10M+ rows | Large |
| Memory Usage | 200-500MB | 500MB-2GB | 100-500MB | Varies | Low | Per-worker |
| Speed | 1-5k/sec | 10-50k/sec | 50-200k/sec | 5-20k/sec | 100k+/sec | Varies |
| Complexity | High | Low | Low | Medium | Very High | High |
| Maintenance | High | Low | Low | Medium | Very High | Medium |
| Crash Recovery | ❌ None | ❌ None | ✅ Checkpoint | ✅ Retry | ||
| SQL Support | ❌ No | ✅ Yes | ✅ Yes | ✅ Yes | ❌ No | ❌ No |
| New Dependencies | None | DuckDB | DuckDB | None | Arrow | Redis |
| Operational Risk | Medium | Low | Low | High | High | Medium |
| Future-Proof | ✅ Excellent | ❌ Low |
DuckDB is specifically designed for this exact use case:
- Embedded OLAP database - no server, no network overhead
- Columnar storage - optimized for analytics queries (JOINs, aggregations)
- Automatic spilling - seamlessly uses disk when memory is full
- SQL interface - simple, declarative, easy to maintain
- Production-ready - used by companies like Motherduck, ClickHouse Cloud
- Active development - backed by DuckDB Labs, frequent releases
-
Phase 1: Single Upload UX (1-2 days)
- Update UI to accept all files at once (CRM + enrichments)
- Upload all files to S3 in parallel
- Create job with all file metadata
-
Phase 2: DuckDB Integration (2-3 days)
- Install
duckdbnpm package - Create DuckDB service wrapper
- Implement CSV import using
read_csv_auto() - Implement SQL-based matching logic
- Implement streaming export
- Install
-
Phase 3: Testing & Optimization (1-2 days)
- Test with 200k row dataset
- Monitor memory usage
- Optimize SQL queries
- Add progress tracking
-
Phase 4: Production Deployment (1 day)
- Update Docker container with DuckDB binary
- Add temp disk cleanup cron job
- Monitor production metrics
Backward compatibility:
- Keep existing CSV upload flow for now
- Add new "Batch Upload" option
- Gradually migrate users to new flow
- Deprecate old flow after 3 months
Development time: 5-8 days
Ongoing maintenance: Low (SQL is stable, DuckDB is mature)
Infrastructure cost: $0 (uses ephemeral disk, no new services)
Risk: Low (can rollback to old flow if issues)
A: Yes, significantly. DuckDB can handle 100M+ rows on disk, while CSV streaming tops out around 1M rows.
A: Yes. SQL-based logic is easier to maintain and optimize than custom streaming code. DuckDB's query planner will automatically optimize queries as the codebase evolves.
A: Main blind spots:
- Disk space management - need to clean up temp files (mitigated with cron job)
- DuckDB version compatibility - need to pin version (mitigated with package.json lock)
- Binary distribution - need DuckDB in container (mitigated with Dockerfile)
A: Potential negatives:
- Learning curve - team needs to learn DuckDB-specific SQL (but it's 95% standard SQL)
- Debugging - harder to debug SQL queries than imperative code (mitigated with EXPLAIN ANALYZE)
- Vendor lock-in - if DuckDB project dies, need to migrate (unlikely - project is well-funded and growing)
A: For this specific use case (CRM data merging with 200k+ rows), DuckDB is the best option. The only "better" alternative would be a fully managed service like Snowflake or BigQuery, but that adds significant cost and operational complexity.
Implement DuckDB Persistent (Option 3) for the following reasons:
- ✅ Solves the immediate problem - no more memory crashes
- ✅ Future-proof - handles unlimited dataset sizes
- ✅ Low complexity - SQL is simpler than streaming logic
- ✅ Fast - 10-40x faster than CSV streaming
- ✅ Maintainable - easy to add features, easy to debug
- ✅ Low risk - can rollback to old flow if needed
- ✅ No infrastructure changes - just add one npm package
Next Steps:
- Get user approval on this approach
- Implement Phase 1 (single upload UX)
- Implement Phase 2 (DuckDB integration)
- Test with 200k row dataset
- Deploy to production
- Monitor and optimize