Skip to content

Add skipDuplicates() scope guard to createDocuments#852

Open
premtsd-code wants to merge 21 commits intomainfrom
csv-import-upsert-v2
Open

Add skipDuplicates() scope guard to createDocuments#852
premtsd-code wants to merge 21 commits intomainfrom
csv-import-upsert-v2

Conversation

@premtsd-code
Copy link
Copy Markdown
Contributor

@premtsd-code premtsd-code commented Apr 8, 2026

Summary

Adds a Database::skipDuplicates() scope guard so callers can wrap a createDocuments() call and have duplicate-by-id rows silently skipped instead of throwing DuplicateException. Used by CSV import / re-import flows where re-running the same batch should be idempotent rather than fatal.

The shape follows the existing silent(), skipRelationships(), skipValidation() pattern — instance-level scope guard, not a flag parameter — so callers compose naturally:

$database->skipDuplicates(fn () => $database->createDocuments($collection, $batch));

Design

The orchestrator does the work; adapters just need a way to not blow up on duplicates.

Database::createDocuments in skipDuplicates mode:

  1. Intra-batch dedup by tenant-aware id (first occurrence wins, prevents ACL drift)
  2. Pre-fetch existing ids via fetchExistingByIds() — one batched find keyed by tenant
  3. Filter $documents upfront so only new rows reach encode / validate / relationship-create
  4. Each chunk goes to the adapter, which inserts with its native duplicate-tolerant primitive

Implementation per adapter

Adapter Strategy
MariaDB / MySQL INSERT IGNORE INTO
PostgreSQL ON CONFLICT (_uid[, _tenant]) DO NOTHING
SQLite INSERT OR IGNORE INTO
MongoDB upsert + $setOnInsert (atomic, no insertMany transaction abort), wrapped to skip the surrounding transaction so snapshot isolation doesn't cause WriteConflict

Mirror

Mirror::createDocuments previously forwarded the original input batch to the destination. In skipDuplicates mode that let the destination diverge from the source whenever the source no-op'd a write — and triggered an "All documents must have an sequence if one is set" throw inside the destination's createDocuments whenever a partially-skipped batch arrived (the source assigns a sequence to inserted docs but not to skipped ones).

Mirror now wraps the source's onNext callback to capture which docs the source actually inserted, and forwards only those to the destination. Skipped docs are never mirrored.

Also in this PR

  • upsertDocuments perf: replaced N×getDocument calls with one batched find keyed by tenant — ~2× speedup for typical batches
  • Helpers extracted: tenantKey(Document) and fetchExistingByIds() deduplicate the tenant-aware grouping logic between createDocuments and upsertDocuments

New tests

End-to-end coverage exercised across all 13 adapter/mode variants (MariaDB / MySQL / Postgres / SQLite / MongoDB × Pool / Mirror / SharedTables / Schemaless):

  • testCreateDocumentsIgnoreDuplicates — mix of existing + new docs
  • testCreateDocumentsIgnoreIntraBatchDuplicates — same id twice in one batch, first occurrence wins, no ACL drift
  • testCreateDocumentsIgnoreAllDuplicates — every doc already exists, nothing inserted
  • testCreateDocumentsSkipDuplicatesEmptyBatch — empty batch is a no-op
  • testCreateDocumentsSkipDuplicatesNestedScope — scope guard saves & restores correctly through nested calls
  • testCreateDocumentsSkipDuplicatesLargeBatch — 300 docs / 50 pre-existing, exercises the chunk loop and verifies pre-existing rows aren't overwritten
  • testCreateDocumentsSkipDuplicatesSecondCallSkipsAll — identical second call preserves the first call's values
  • testCreateDocumentsSkipDuplicatesRelationships — one-to-many relationship + ignored parent must not produce orphan child rows
  • testCreateDocumentsSkipDuplicatesDoesNotDivergeDestination (Mirror) — source-skipped docs are not forwarded to the destination

Test plan

  • All ignore / skipDuplicates tests pass on every adapter variant
  • Mirror regression test verified failing without the fix and passing with it
  • PHPStan level 7 clean
  • No regressions in existing testCreateDocuments* and testUpsert* suites

Summary by CodeRabbit

  • New Features

    • Optional "ignore duplicates" mode for document creation that silently skips existing entries and returns only newly inserted counts
    • Automatic intra-batch deduplication (first occurrence wins) to prevent duplicate writes and ACL drift
    • Batch-aware pre-checks to skip inserts when items already exist; mirror/storage layers honor ignore mode
    • Relationship creation deferred to avoid applying relations for skipped documents
  • Tests

    • Added end-to-end tests for mixed, intra-batch, and all-duplicate scenarios with ignore enabled

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an optional bool $ignore = false across Database.createDocuments, adapters, and mirror to skip duplicate inserts: implements intra-batch deduplication, pre-fetch filtering of existing IDs, adapter-specific INSERT/IGNORE/ON CONFLICT hooks, deferred relationship writes for newly inserted docs, and e2e tests for ignore-mode behavior.

Changes

Cohort / File(s) Summary
Core adapter interface
src/Database/Adapter.php
Abstract createDocuments signature extended with bool $ignore = false and docblock updated.
Adapter delegation / pool
src/Database/Adapter/Pool.php
createDocuments(..., bool $ignore = false) signature added; delegation forwards new arg via delegate(__FUNCTION__, \func_get_args()).
Mongo adapter
src/Database/Adapter/Mongo.php
createDocuments(..., bool $ignore=false) now pre-queries existing _uid (tenant-aware) and filters out existing docs before insertMany when $ignore is true; preserves transaction context.
SQL adapters & helpers
src/Database/Adapter/SQL.php, src/Database/Adapter/SQLite.php, src/Database/Adapter/Postgres.php
SQL insert flow updated to accept $ignore; new protected hooks (getInsertKeyword, getInsertSuffix, getInsertPermissionsSuffix) provide adapter-specific duplicate-handling SQL (e.g., INSERT OR IGNORE, ON CONFLICT DO NOTHING); post-insert reconciliation added when needed.
Database orchestration
src/Database/Database.php
createDocuments(..., bool $ignore=false) implements intra-batch dedup (id or tenant:id), chunked pre-fetch of existing IDs, skips relationship creation for pre-existing keys, defers relationship writes for actually-inserted docs, filters chunks before adapter calls, and passes $ignore to adapters. upsertDocumentsWithIncrease() refactored to batch-find existing docs.
Mirroring
src/Database/Mirror.php
createDocuments(..., bool $ignore=false) added and forwarded to source/destination (including preserve-dates path).
Tests
tests/e2e/Adapter/Scopes/DocumentTests.php
Added 3 e2e tests verifying createDocuments(..., ignore: true) for mixed new/duplicate, intra-batch duplicates (first occurrence wins), and all-duplicate batches (zero inserts, no emissions).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Database
    participant Adapter
    participant Storage as DB

    rect rgba(200,100,150,0.5)
    Note over Client,Storage: Existing flow (ignore: false)
    Client->>Database: createDocuments(docs, ignore: false)
    Database->>Database: prepare documents, no pre-filter
    Database->>Adapter: createDocuments(docs, ignore: false)
    Adapter->>DB: INSERT INTO ...
    DB-->>Adapter: ✗ duplicate error (may throw)
    Adapter-->>Database: throws DuplicateException
    Database-->>Client: exception propagated
    end
Loading
sequenceDiagram
    participant Client
    participant Database
    participant Adapter
    participant Storage as DB

    rect rgba(100,180,120,0.5)
    Note over Client,Storage: New flow (ignore: true)
    Client->>Database: createDocuments(docs, ignore: true)
    Database->>Database: intra-batch dedup by id / tenant:id
    Database->>Adapter: find(existing ids) [chunked]
    Adapter->>DB: SELECT _uid WHERE _uid IN (...)
    DB-->>Adapter: existing ids
    Adapter-->>Database: $preExistingIds
    Database->>Database: filter out pre-existing docs, strip relationships
    Database->>Adapter: createDocuments(filtered_docs, ignore: true)
    Adapter->>DB: INSERT IGNORE / ON CONFLICT DO NOTHING
    DB-->>Adapter: ✓ success (duplicates skipped)
    Adapter-->>Database: inserted rows info
    Database->>Database: create relationships for actually inserted docs
    Database-->>Client: count (new inserts only)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • fogelito
  • abnegate
  • ArnabChatterjee20k

Poem

🐇 I hopped through rows and checked each id,
I nudged the duplicates gently to the side,
With ignore set true, only newcomers stay,
The first one keeps the carrot — the rest hop away.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.23% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ⚠️ Warning The PR title mentions 'skipDuplicates() scope guard' but the actual implementation adds an optional $ignore parameter to createDocuments, not a scope guard method. Update the title to accurately reflect the change, such as 'Add $ignore parameter to createDocuments for silent duplicate handling' to better match the actual implementation.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch csv-import-upsert-v2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Apr 8, 2026

Greptile Summary

Adds a Database::skipDuplicates() scope guard that threads down through all adapters so createDocuments uses INSERT IGNORE / ON CONFLICT DO NOTHING / upsert-$setOnInsert instead of throwing on duplicate keys. The architecture (save/restore flag, Pool propagation, Postgres/SQLite/Mongo overrides) is clean and consistent with existing scope guards.

  • Mirror divergence for SQL adapters: Mirror::captureOnNext assumes onNext fires only for actually-inserted documents, but the SQL adapter always returns — and therefore fires onNext for — its entire input batch including silently-skipped rows. Skipped source documents are buffered and inserted on the destination, producing the divergence the PR aimed to prevent. The testCreateDocumentsSkipDuplicatesDoesNotDivergeDestination test runs on MariaDB but is missing the assertion that would catch this (destination->getDocument('dup')->isEmpty()).
  • testCreateDocumentsIgnoreIntraBatchDuplicates is listed in the PR description as an added test but is absent from the diff.

Confidence Score: 3/5

  • Not safe to merge — the Mirror skipDuplicates path silently diverges source and destination on SQL adapters, and the test covering it is missing its key assertion.
  • Two P1 findings: (1) Mirror::captureOnNext forwards skipped SQL documents to the destination because the SQL adapter fires onNext for the full input batch; the divergence-prevention test passes only because it omits the assertion that would fail. (2) The consequence is real data inconsistency between source and destination whenever a SQL-backed Mirror is used with skipDuplicates. The adapter-level INSERT IGNORE / ON CONFLICT DO NOTHING mechanics and the Pool/Postgres/SQLite overrides are all correct, so the score is 3 rather than lower.
  • src/Database/Mirror.php and tests/e2e/Adapter/MirrorTest.php require the most attention — the fix and its test are both incomplete for the SQL (MariaDB/Postgres/SQLite) case.

Important Files Changed

Filename Overview
src/Database/Mirror.php New skipDuplicates branch captures ALL onNext-fired documents into the destination buffer, but for SQL-backed sources onNext fires for skipped duplicates too — causing divergence; test missing the key assertion.
src/Database/Database.php skipDuplicates scope guard, tenantKey helper, and upsertDocuments batch-fetch added cleanly; createDocumentRelationships still runs for all documents including duplicates, which can trigger child updates even for skipped parents.
src/Database/Adapter/SQL.php INSERT IGNORE / ON CONFLICT DO NOTHING wired cleanly via getInsertKeyword/getInsertSuffix hooks; adapter still returns full input batch including silently-skipped rows.
src/Database/Adapter/Mongo.php upsert+$setOnInsert path avoids transaction-aborting insertMany on duplicates; returns full input batch rather than only inserted documents, same as SQL adapter.
src/Database/Adapter/Pool.php Correctly propagates skipDuplicates flag to pinned and non-pinned inner adapters in all three delegation paths.
tests/e2e/Adapter/MirrorTest.php New Mirror divergence test is missing the critical assertion that verifies destination's skipped document is NOT inserted.

Comments Outside Diff (2)

  1. tests/e2e/Adapter/MirrorTest.php, line 572-577 (link)

    P1 Missing assertion lets divergence pass silently

    The test never checks what happened to 'dup' on the destination. For a SQL-backed Mirror (MariaDB here), the SQL adapter's createDocuments returns the full input batch regardless of which rows were actually skipped — so onNext fires for both 'dup' and 'fresh'. Mirror::captureOnNext puts both in the buffer and flushes them to the destination. Since the destination has neither document, it inserts both, ending up with dup(WouldBe) while the source keeps dup(Original) — a silent divergence that this test is supposed to prevent.

    Adding one assertion would expose the failure:

  2. src/Database/Mirror.php, line 460-484 (link)

    P1 captureOnNext captures skipped documents for SQL adapters

    The buffer-and-forward design assumes onNext is called only for documents that the source actually inserted. That holds for MongoDB (upsert path), but not for SQL adapters: SQL::createDocuments always returns $documents (the full input array), so Database::createDocuments calls onNext for every document — including those silently skipped by INSERT IGNORE / ON CONFLICT DO NOTHING.

    In the MariaDB Mirror test, this means:

    1. Source skips 'dup' (already exists) but fires onNext with the batch value (WouldBe).
    2. captureOnNext adds dup(WouldBe) to the buffer.
    3. The flush sends it to the destination, which doesn't have 'dup' and inserts it.
    4. Source has dup(Original); destination has dup(WouldBe) — exactly the divergence the PR aimed to fix.

    The root cause is that the SQL adapter has no reliable way to distinguish inserted from skipped rows post-INSERT IGNORE. A targeted fix here would be to compare each buffered document's ID against the pre-call set of documents known to exist on the source before forwarding to the destination.

Reviews (24): Last reviewed commit: "skipDuplicates: drop intra-batch dedup, ..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
tests/e2e/Adapter/Scopes/DocumentTests.php (1)

7815-7818: Prefer __FUNCTION__ for collection naming consistency.

Using __FUNCTION__ here (as in adjacent tests) improves isolation and avoids accidental name reuse across reruns.

♻️ Suggested tweak
-        $col = 'createDocsIgnoreIntraBatch';
+        $col = __FUNCTION__;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/Adapter/Scopes/DocumentTests.php` around lines 7815 - 7818, The
collection name is hard-coded as 'createDocsIgnoreIntraBatch' which can collide
across test runs; change the $col assignment to use the current function name
(replace the literal with __FUNCTION__) so the calls to
$database->createCollection($col) and $database->createAttribute($col, 'name',
Database::VAR_STRING, 128, true) use a unique, test-local collection name
consistent with adjacent tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Database/Adapter/Mongo.php`:
- Around line 1501-1524: The current dedupe prefetch uses only _uid and a single
tenant filter, causing cross-tenant collisions for shared tables; update the
logic that builds existingUids (and the prefetch loop around
$this->client->find/$this->client->getMore) to be tenant-aware by using a
composite key (e.g., "$tenant:$uid") when populating $existingUids, include the
tenant in the $filters returned by $this->getTenantFilters($collection->getId())
per chunk, and prefetch in per-tenant chunks using array_chunk($uidValues,
max(1, $this->maxQueryValues)) (use the chunk size when calling getMore instead
of count($uidValues)); then when filtering $records check the composite key
($record['_tenant'] . ':' . ($record['_uid'] ?? '')) against $existingUids so
duplicate detection respects tenant boundaries.
- Around line 1504-1516: The prefetch block calls $this->client->find() and
$this->client->getMore() without exception mapping; wrap each driver call in a
try/catch that catches \Throwable and rethrows the normalized exception via
processException($e) (same pattern used elsewhere in this class), i.e. surround
the $this->client->find(...) call and the $this->client->getMore(...) call with
try { ... } catch (\Throwable $e) { throw $this->processException($e); } so any
raw driver errors from the $response/$more handling are mapped consistently.

In `@src/Database/Database.php`:
- Around line 7210-7255: The upsertDocumentsWithIncrease() flow builds
$existingDocs keyed as tenant:id when $upsertTenantPerDocument is true, but
later duplicate-input detection still uses only $document->getId(), causing
false DuplicateException across tenants; update the duplicate-input check and
any intermediate maps (e.g., $seenIds/$inputKeys) to use the same composite
lookup key (use the existing $lookupKey logic: $tenant . ':' .
$document->getId() when $upsertTenantPerDocument is true) so the tenant-aware
key is reused consistently across the whole upsert path.

---

Nitpick comments:
In `@tests/e2e/Adapter/Scopes/DocumentTests.php`:
- Around line 7815-7818: The collection name is hard-coded as
'createDocsIgnoreIntraBatch' which can collide across test runs; change the $col
assignment to use the current function name (replace the literal with
__FUNCTION__) so the calls to $database->createCollection($col) and
$database->createAttribute($col, 'name', Database::VAR_STRING, 128, true) use a
unique, test-local collection name consistent with adjacent tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c9f1924e-3ed0-43a8-8859-6e6c299aadb5

📥 Commits

Reviewing files that changed from the base of the PR and between 72ee161 and b9c7f4d.

📒 Files selected for processing (9)
  • src/Database/Adapter.php
  • src/Database/Adapter/Mongo.php
  • src/Database/Adapter/Pool.php
  • src/Database/Adapter/Postgres.php
  • src/Database/Adapter/SQL.php
  • src/Database/Adapter/SQLite.php
  • src/Database/Database.php
  • src/Database/Mirror.php
  • tests/e2e/Adapter/Scopes/DocumentTests.php

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
src/Database/Database.php (2)

5682-5726: ⚠️ Potential issue | 🟠 Major

Don’t let the outer prefetch snapshot suppress writes.

The adapters already do the authoritative duplicate filtering inside the transaction/session, but this code still drops $chunk entries from a pre-transaction find(). If a matching row is deleted after that snapshot but before the insert transaction starts, ignore: true will incorrectly skip a document that should now insert, and $modified / $onNext will be low.

Also applies to: 5805-5815

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Database.php` around lines 5682 - 5726, The current prefetch
logic builds $preExistingIds (when $ignore is true, in the tenantPerDocument and
non-tenant branches using find()/authorization->skip/etc.) and later uses that
snapshot to skip writes, which can wrongly suppress inserts if a matching row
was deleted after the prefetch; remove or stop using this prefetch-based
suppression. Instead, eliminate reliance on $preExistingIds to skip document
inserts/relationship writes (and the related checks that reference it around the
relationship-resolution code and at the later block noted ~5805-5815), and
either rely on the adapters’ authoritative duplicate filtering inside the
transaction/session or re-check existence inside the transaction. Keep the
prefetch only if it’s used for optimizations that do not affect whether an
insert occurs, or remove the prefetch entirely if it’s unused for
non-authoritative optimizations.

5788-5790: ⚠️ Potential issue | 🟠 Major

Use the same tenant-aware key for deferred relationship writes.

In tenant-per-document mode, two inserted documents can legitimately share the same $id. This map is keyed only by $id, so the later payload overwrites the earlier one, and unset() after the first insert prevents the second tenant’s relationship write entirely.

♻️ Proposed fix
                 if (!empty($relationshipData)) {
-                    $deferredRelationships[$document->getId()] = $relationshipData;
+                    $relationshipKey = $tenantPerDocument
+                        ? $document->getTenant() . ':' . $document->getId()
+                        : $document->getId();
+                    $deferredRelationships[$relationshipKey] = $relationshipData;
                 }
@@
                 foreach ($batch as $insertedDoc) {
-                    $docId = $insertedDoc->getId();
-                    if (\array_key_exists($docId, $deferredRelationships)) {
+                    $relationshipKey = $tenantPerDocument
+                        ? $insertedDoc->getTenant() . ':' . $insertedDoc->getId()
+                        : $insertedDoc->getId();
+                    if (\array_key_exists($relationshipKey, $deferredRelationships)) {
                         $relDoc = clone $insertedDoc;
-                        foreach ($deferredRelationships[$docId] as $key => $value) {
+                        foreach ($deferredRelationships[$relationshipKey] as $key => $value) {
                             $relDoc->setAttribute($key, $value);
                         }
                         $this->silent(fn () => $this->createDocumentRelationships($collection, $relDoc));
-                        unset($deferredRelationships[$docId]);
+                        unset($deferredRelationships[$relationshipKey]);
                     }
                 }

Also applies to: 5823-5832

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Database.php` around lines 5788 - 5790, The deferred
relationship map ($deferredRelationships) is keyed only by $document->getId(),
which causes collisions in tenant-per-document mode; change the key to be
tenant-aware (e.g., combine tenant id and document id or use the document's
tenant identifier method) wherever entries are set and removed so each
tenant/document pair is unique—update the code that assigns
$deferredRelationships[$document->getId()] = $relationshipData; and the matching
unset() logic (and the same pattern around the related block at lines referenced
5823-5832) to use the composite tenant-aware key instead of just getId().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Database/Adapter/SQL.php`:
- Around line 2482-2527: The pre-filter SELECT that checks for existing _uid
rows (the block building $sql, preparing via $this->getPDO()->prepare, binding
values and calling $stmt->execute()/fetchAll()) can throw a raw PDOException
because it runs outside the method's main try/catch; wrap the entire
prepare/execute/fetch block in the adapter's processException(...) wrapper so
exceptions are mapped consistently (i.e., call $this->processException(...)
around the closure that prepares, binds, executes and fetches, returning $rows),
keeping the logic inside createDocuments() and referencing the existing $sql,
$binds, $stmt and $rows variables.
- Around line 2665-2713: The reconciliation currently keys timestamps only by
_uid and uses a single :_vfy_tenant bind when $this->sharedTables is true, which
corrupts mixed-tenant batches; update the verification to be tenant-aware by
binding the tenant per document and keying by a composite tenant+uid.
Concretely, in the block that builds $verifyPlaceholders/$verifyBinds and
$verifySql (the code using $verifyPlaceholders, $verifyBinds, $verifySql and
$verifyStmt), include the tenant value for each document (use the document's
tenant when tenant-per-document applies), change the WHERE to match pairs (e.g.
use row-value tuples or explicit (_uid = :uX AND _tenant = :tX) clauses), and
then build $actualTimestamps keyed by "$tenant|$uid" (and compare against
$expectedTimestamps keyed the same way) so the subsequent loop that populates
$insertedDocs/$insertedUids only matches rows with the correct tenant+uid.
- Around line 2650-2652: Replace the use of "INSERT IGNORE" (returned by
getInsertKeyword) with a plain "INSERT" and remove any IGNORE-specific behavior
in getInsertSuffix so that constraint violations are not globally suppressed;
instead let the DB raise errors and update the insert/execute path that catches
SQL exceptions to inspect the database error/constraint name (e.g., index or key
name such as _uid) and map errors to DuplicateException only when the violated
constraint corresponds to the _uid duplicate, otherwise map to UniqueException
(or other specific exceptions) for PRIMARY/UNIQUE/NOT NULL/truncation
violations—use the existing functions getInsertKeyword and getInsertSuffix to
locate changes and adjust the exception-mapping in the code that performs the
insert and error handling.

---

Duplicate comments:
In `@src/Database/Database.php`:
- Around line 5682-5726: The current prefetch logic builds $preExistingIds (when
$ignore is true, in the tenantPerDocument and non-tenant branches using
find()/authorization->skip/etc.) and later uses that snapshot to skip writes,
which can wrongly suppress inserts if a matching row was deleted after the
prefetch; remove or stop using this prefetch-based suppression. Instead,
eliminate reliance on $preExistingIds to skip document inserts/relationship
writes (and the related checks that reference it around the
relationship-resolution code and at the later block noted ~5805-5815), and
either rely on the adapters’ authoritative duplicate filtering inside the
transaction/session or re-check existence inside the transaction. Keep the
prefetch only if it’s used for optimizations that do not affect whether an
insert occurs, or remove the prefetch entirely if it’s unused for
non-authoritative optimizations.
- Around line 5788-5790: The deferred relationship map ($deferredRelationships)
is keyed only by $document->getId(), which causes collisions in
tenant-per-document mode; change the key to be tenant-aware (e.g., combine
tenant id and document id or use the document's tenant identifier method)
wherever entries are set and removed so each tenant/document pair is
unique—update the code that assigns $deferredRelationships[$document->getId()] =
$relationshipData; and the matching unset() logic (and the same pattern around
the related block at lines referenced 5823-5832) to use the composite
tenant-aware key instead of just getId().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d2ab610b-bb64-4dcb-979c-f3880d9e99df

📥 Commits

Reviewing files that changed from the base of the PR and between 20846d3 and 242c9a8.

📒 Files selected for processing (3)
  • src/Database/Adapter/Mongo.php
  • src/Database/Adapter/SQL.php
  • src/Database/Database.php
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/Database/Adapter/Mongo.php

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
src/Database/Adapter/SQL.php (3)

2482-2527: ⚠️ Potential issue | 🟠 Major

Wrap the ignore-mode pre-filter in processException() handling.

This query runs before the method’s main try/catch, so SQL or connection failures here still escape as raw PDOException while the rest of createDocuments() is adapter-mapped.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Adapter/SQL.php` around lines 2482 - 2527, The pre-filter block
in createDocuments() that prepares and executes the SELECT (using getSQLTable(),
getPDO(), getPDOType(), preparing $sql and executing $stmt) should be wrapped
inside a call to processException() so any PDOException is adapter-mapped
instead of escaping; move the existing logic that builds $placeholders/$binds,
$tenantFilter, $sql, $stmt->bindValue(), $stmt->execute(), $rows =
$stmt->fetchAll() and $stmt->closeCursor() into a closure passed to
processException(), and return/assign the result from that closure (or rethrow
via processException) so behavior matches the rest of createDocuments().

2782-2785: ⚠️ Potential issue | 🟠 Major

Don’t use INSERT IGNORE as the base implementation for ignore=true.

For adapters that inherit this default, ignore=true stops meaning “skip duplicate document IDs” and starts suppressing a broader class of insert failures on both the document and permission insert paths. The base hook should stay plain INSERT INTO, and adapter-specific duplicate handling should be expressed with targeted conflict clauses or constraint-aware exception mapping.

Does MariaDB/MySQL `INSERT IGNORE` suppress constraint violations beyond duplicate-key conflicts, such as PRIMARY KEY, other UNIQUE, NOT NULL, truncation, or conversion errors?

Based on learnings, in src/Database/Adapter/MariaDB.php only duplicate _uid violations should throw DuplicateException; other unique or PRIMARY collisions should still throw UniqueException.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Adapter/SQL.php` around lines 2782 - 2785, The base method
getInsertKeyword in class SQL should not return 'INSERT IGNORE INTO' for
ignore=true; change it to always return the plain 'INSERT INTO' and move any
ignore semantics into adapter-specific code (e.g., Adapter/MariaDB.php) by using
targeted conflict clauses or by mapping constraint/SQLSTATE errors to the
correct exceptions; specifically, in MariaDB adapter ensure only duplicate
violations on the document _uid map to DuplicateException while other
UNIQUE/PRIMARY/constraint failures map to UniqueException (and do not rely on
INSERT IGNORE in SQL::getInsertKeyword).

2664-2728: ⚠️ Potential issue | 🔴 Critical

Keep expectedTimestamps tenant-aware too.

$actualTimestamps is keyed by tenant:uid now, but $expectedTimestamps is still keyed only by _uid. In tenantPerDocument mode, two documents with the same _uid in different tenants can overwrite each other here, causing an inserted document to be dropped from $documents and miss its permission insert.

💡 Proposed fix
-                $expectedTimestamps = [];
-                foreach ($documents as $doc) {
-                    $expectedTimestamps[$doc->getId()] = $doc->getCreatedAt();
-                }
+                $expectedTimestamps = [];
+                foreach ($documents as $doc) {
+                    $key = ($this->sharedTables && $this->tenantPerDocument)
+                        ? $doc->getTenant() . ':' . $doc->getId()
+                        : $doc->getId();
+                    $expectedTimestamps[$key] = $doc->getCreatedAt();
+                }
@@
-                    if (isset($actualTimestamps[$key]) && $actualTimestamps[$key] === $expectedTimestamps[$doc->getId()]) {
+                    if (isset($actualTimestamps[$key]) && $actualTimestamps[$key] === $expectedTimestamps[$key]) {
                         $insertedDocs[] = $doc;
                     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Adapter/SQL.php` around lines 2664 - 2728, The
expectedTimestamps map is not tenant-aware while actualTimestamps is, so when
sharedTables && tenantPerDocument you must key expectedTimestamps the same way
as actualTimestamps (tenant + ':' + uid) during the initial foreach over
$documents and also use that key when comparing later (instead of
$expectedTimestamps[$doc->getId()]); update the loop that builds
$expectedTimestamps and the comparison that checks equality to construct/look up
keys as ($doc->getTenant() . ':' . $doc->getId()) when $this->sharedTables &&
$this->tenantPerDocument so entries don't collide across tenants.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/Database/Adapter/SQL.php`:
- Around line 2482-2527: The pre-filter block in createDocuments() that prepares
and executes the SELECT (using getSQLTable(), getPDO(), getPDOType(), preparing
$sql and executing $stmt) should be wrapped inside a call to processException()
so any PDOException is adapter-mapped instead of escaping; move the existing
logic that builds $placeholders/$binds, $tenantFilter, $sql, $stmt->bindValue(),
$stmt->execute(), $rows = $stmt->fetchAll() and $stmt->closeCursor() into a
closure passed to processException(), and return/assign the result from that
closure (or rethrow via processException) so behavior matches the rest of
createDocuments().
- Around line 2782-2785: The base method getInsertKeyword in class SQL should
not return 'INSERT IGNORE INTO' for ignore=true; change it to always return the
plain 'INSERT INTO' and move any ignore semantics into adapter-specific code
(e.g., Adapter/MariaDB.php) by using targeted conflict clauses or by mapping
constraint/SQLSTATE errors to the correct exceptions; specifically, in MariaDB
adapter ensure only duplicate violations on the document _uid map to
DuplicateException while other UNIQUE/PRIMARY/constraint failures map to
UniqueException (and do not rely on INSERT IGNORE in SQL::getInsertKeyword).
- Around line 2664-2728: The expectedTimestamps map is not tenant-aware while
actualTimestamps is, so when sharedTables && tenantPerDocument you must key
expectedTimestamps the same way as actualTimestamps (tenant + ':' + uid) during
the initial foreach over $documents and also use that key when comparing later
(instead of $expectedTimestamps[$doc->getId()]); update the loop that builds
$expectedTimestamps and the comparison that checks equality to construct/look up
keys as ($doc->getTenant() . ':' . $doc->getId()) when $this->sharedTables &&
$this->tenantPerDocument so entries don't collide across tenants.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 892638bd-ac20-414b-9059-7f7e20295028

📥 Commits

Reviewing files that changed from the base of the PR and between 242c9a8 and 455da70.

📒 Files selected for processing (2)
  • src/Database/Adapter/SQL.php
  • src/Database/Database.php
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/Database/Database.php

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Database/Adapter/SQL.php`:
- Around line 2676-2682: The verification query is binding composite keys from
$expectedTimestamps into $verifyBinds/placeholders and then comparing them
against the _uid column, but when tenantPerDocument is enabled _uid was set to
the raw document ID ($attributes['_uid'] = $document->getId()), so the WHERE
_uid IN (...) must use the document IDs not the composite keys. Fix by
extracting the document ID portion from each composite key before creating $ph
and adding to $verifyBinds (or if keys are already plain IDs, leave unchanged);
update the loop that builds $verifyPlaceholders/$verifyBinds (the foreach over
array_values(array_unique(array_keys($expectedTimestamps)))) to derive $uid =
<documentIdFromComposite($uid)> and bind that instead so the verification query
matches rows by actual _uid values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 43af96a2-d479-4354-86c9-7d68abad7b56

📥 Commits

Reviewing files that changed from the base of the PR and between 455da70 and deedfda.

📒 Files selected for processing (1)
  • src/Database/Adapter/SQL.php

Adapters:
- MariaDB/MySQL: INSERT IGNORE INTO
- PostgreSQL: ON CONFLICT DO NOTHING
- SQLite: INSERT OR IGNORE INTO
- MongoDB: session-scoped pre-filter before insertMany

Database.php:
- Intra-batch dedup by ID (tenant-aware, first occurrence wins)
- Pre-fetch existing IDs to skip known duplicates
- Deferred relationship creation for ignore mode (no orphans)
- Race-condition reconciliation via _createdAt timestamp verification

upsertDocuments:
- Batch-fetch existing docs with find() instead of per-row getDocument()
- Tenant-aware composite keys for seenIds duplicate check

All paths are tenant-per-document aware with null-safe array_filter.
- testCreateDocumentsIgnoreDuplicates: mixed batch with onNext assertions
- testCreateDocumentsIgnoreIntraBatchDuplicates: first-wins, no ACL drift
- testCreateDocumentsIgnoreAllDuplicates: zero inserts, empty onNext
@premtsd-code premtsd-code force-pushed the csv-import-upsert-v2 branch from f2b08dc to bee42d4 Compare April 9, 2026 10:28
…x _uid mapping

- Pass ignoreDuplicates option to insertMany for race-condition safety
- Extract _uid from raw array before replaceChars (excluded from transformation)
- Map inserted records back to original Document objects by _uid
- Add skipDuplicates(callable) scope guard on Database, following
  existing pattern (skipRelationships, skipValidation, etc.)
- Remove bool $ignore parameter from createDocuments signature
- Mirror propagates skipDuplicates state to source and destination
- Update tests to use $database->skipDuplicates(function() { ... })
- Add $skipDuplicates property + skipDuplicates() scope guard to Adapter
- Remove bool $ignore parameter from all adapter createDocuments signatures
  (Adapter, Pool, Mongo, SQL)
- Remove bool $ignore from SQL helper methods (getInsertKeyword,
  getInsertSuffix, getInsertPermissionsSuffix) and Postgres/SQLite overrides
- Pool delegate() and withTransaction() propagate skipDuplicates state to
  pooled/pinned adapter via the scope guard
- Database::createDocuments() wraps adapter call in adapter->skipDuplicates()
  when the flag is set, drops the local $ignore variable
- skipDuplicates() takes optional $enable so Pool/Mirror/Database can
  forward state in a single call instead of branching
- Database: tenantKey(Document) helper replaces ~8 inline ternaries
- Database: fetchExistingByIds() helper replaces the tenant-grouped
  batched find() in createDocuments and upsertDocuments pre-fetch
- Database: drop dead guard on non-defer relationship path
- Mongo: findExistingUids() helper replaces the duplicated cursor walk
  in the skipDuplicates pre-filter (tenant and non-tenant variants)
- SQL: buildUidTenantLookup() helper replaces the duplicated
  placeholder-and-binds builder shared by pre-filter and reconciliation
When a batch has only null-ID documents but non-null tenants, the
previous code emitted WHERE _uid IN () AND _tenant IN (...) — invalid
SQL that the !empty(binds) call-site guard failed to catch because the
tenant binds kept the binds array non-empty.

Return 'where' => '1=0' with empty binds when there are no UIDs, so
the query matches nothing and leaves no dangling placeholders.
- testCreateDocumentsSkipDuplicatesEmptyBatch: no-op with empty array
- testCreateDocumentsSkipDuplicatesNestedScope: scope guard save/restore
  through nested skipDuplicates() calls
- testCreateDocumentsSkipDuplicatesLargeBatch: 300-doc batch with 50
  pre-existing, exercising the chunk loop
- testCreateDocumentsSkipDuplicatesSecondCallSkipsAll: second identical
  batch should preserve First values, not overwrite with Second
- testCreateDocumentsSkipDuplicatesRelationships: skipDuplicates with a
  one-to-many relationship — verifies deferred relationship logic does
  not create orphan child rows for ignored parents
INSERT IGNORE / ON CONFLICT DO NOTHING / INSERT OR IGNORE handle
duplicates at the DB level, and the existing reconciliation block
identifies actually-inserted rows via the _createdAt timestamp. The
pre-filter was a wire-data optimization that ran a SELECT before the
INSERT, then trimmed the batch — but the same SELECT (with timestamp
comparison) already happens in the reconciliation path on demand.

Net: ~45 lines removed, single code path for skipDuplicates instead
of pre-filter + fallback. Race-condition handling is unchanged
(reconciliation path covers it).
Previously the Database orchestrator carried a pre-fetch + chunk-filter
pass to compensate for Mongo's plain upsert() returning matched +
upserted combined ('n') — without that compensation, $modified would
over-count the number of pre-existing docs in the batch.

This pushes the responsibility into the adapter, where it belongs:
- composer bump to utopia-php/mongo dev-upsert-return-upserted-count
  (PR #37, adds Client::upsertWithCounts() returning matched/modified/
  upserted[] separately)
- Mongo adapter calls upsertWithCounts() and maps the response's
  upserted[].index entries back to the originating Document objects,
  returning ONLY the actually-inserted docs (race-safe)
- Database::createDocuments drops $preExistingIds pre-fetch and the
  chunk filter; the adapter is now responsible for accurate counts

Net effect on Database.php: ~17 lines removed. Net effect on Mongo.php:
~80 lines net deletion (replaced the inline pre-filter with a single
upsertWithCounts call + index mapping).
Previously Mirror::createDocuments built its destination payload from
the original input ($documents), not from the docs the source actually
persisted. In skipDuplicates mode that lets the destination diverge from
the source whenever the source no-ops a write:

  source    has  id=a, name='Original'
  dest      missing id=a
  caller    skipDuplicates(createDocuments([{id: a, name: 'Duplicate'}]))

The source skips a (already present) but the Mirror still forwards
{id: a, name: 'Duplicate'} to the destination, which inserts it — the
two replicas now hold different values for the same row.

The same code path also throws "All documents must have an sequence if
one is set" inside Database::createDocuments when a partially-skipped
batch reaches the destination, because the inserted doc has a sequence
assigned by the source adapter while the skipped doc does not. Mirror
catches and silently logs that exception, so the destination ends up
missing both rows.

Fix: wrap the user's onNext to capture the docs the source actually
persisted, then forward only those to the destination. Skipped docs
are now correctly never mirrored.

Adds testCreateDocumentsSkipDuplicatesDoesNotDivergeDestination —
seeds a row in the source only (bypassing the mirror), then calls
skipDuplicates(createDocuments(...)) with a different value for that
id plus a new row. Asserts source keeps its original value, destination
ends up with only the new row, and the skipped value never reaches
either side.
@premtsd-code premtsd-code changed the title Add ignore param to createDocuments for silent duplicate handling Add skipDuplicates() scope guard to createDocuments Apr 13, 2026
The SQL adapter previously carried a pre-filter SELECT plus a _createdAt
reconciliation pass to compensate for INSERT IGNORE not telling us which
rows were actually inserted vs skipped. The Mongo adapter pulled the same
trick via Client::upsertWithCounts() (mongo PR #37). Both layers were
solving the same problem twice.

Move the pre-filter back to Database::createDocuments where it ran
originally, and let each adapter's INSERT IGNORE / ON CONFLICT DO NOTHING
/ upsert+\$setOnInsert serve only as a race-safety net. The adapter just
returns its input as the inserted set.

- Database.php: restore \$preExistingIds pre-fetch + chunk filter
- SQL.php: drop pre-filter, drop _createdAt reconciliation, drop
  buildUidTenantLookup helper (-173 lines)
- Mongo.php: revert to plain Client::upsert(), return input docs
  (withTransaction skipDuplicates bypass stays — still needed to avoid
  WriteConflict under snapshot isolation)
- composer.json/lock: back to stable utopia-php/mongo 1.*

Trade-off: in the race window between the orchestrator's pre-fetch
SELECT and the adapter INSERT, \$modified over-reports by N and onNext
fires for N skipped docs. Common case stays correct.
…en comments

The deferred-relationships mechanism was useful when the adapter could
report which docs were actually inserted (via reconciliation/
upsertWithCounts). With those gone, $batch always equals $chunk, so the
defer-then-match dance is dead weight. The orchestrator already pre-
filters known duplicates upfront, so any survivor is expected to insert
and relationships can be created inline like the non-skipDuplicates path.

- Database.php: pre-filter $documents once, before encode/validate,
  instead of filtering each chunk inline. Drop $deferredRelationships
  scaffolding and the per-chunk match loop. Collapse if/elseif into
  one inline createDocumentRelationships call.
- Mongo.php: tighten withTransaction skipDuplicates comment (the
  "relationship writes are deferred" rationale is no longer accurate).
- Mirror.php: tighten the captureOnNext comment.
- composer.lock: revert unrelated symfony-polyfill-php85 / mongo 1.0.0→
  1.0.2 bumps that crept in from an earlier composer update.

Race window unchanged: a doc that passes pre-filter but loses to a
concurrent insert still produces an orphan relationship and an
over-counted $modified — same as before this commit.
* @param bool $enable
* @return T
*/
public function skipDuplicates(callable $callback, bool $enable = true): mixed
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the $enable flag and check if skipDuplicates instead, otherwise we add the overhead of calling an extra callable when it's false

* @param bool $idsOnly When true, uses Query::select(['$id']) for a lighter fetch
* @return array<string, Document>
*/
private function fetchExistingByIds(Document $collection, array $documents, bool $idsOnly = false): array
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use find

Comment on lines +5751 to +5776
// skipDuplicates: dedupe intra-batch (first wins) then drop already-existing rows.
// Adapter INSERT IGNORE / ON CONFLICT / upsert is the race-safety net.
if ($this->skipDuplicates) {
$seenIds = [];
$deduplicated = [];
foreach ($documents as $document) {
if ($document->getId() !== '') {
$dedupeKey = $this->tenantKey($document);
if (isset($seenIds[$dedupeKey])) {
continue;
}
$seenIds[$dedupeKey] = true;
}
$deduplicated[] = $document;
}
$documents = $deduplicated;

$preExistingIds = $this->fetchExistingByIds($collection, $documents, idsOnly: true);
if (!empty($preExistingIds)) {
$documents = \array_values(\array_filter(
$documents,
fn (Document $doc) => !isset($preExistingIds[$this->tenantKey($doc)])
));
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do this if using insert ignore

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abnegate The child collection is getting orphan rows. when multiple rows with same parent and different childs. where parent is ignored but child is not ignored

if ($this->resolveRelationships) {
$document = $this->silent(fn () => $this->createDocumentRelationships($collection, $document));
}

Comment on lines +607 to +613
$insertedFromSource = [];
$captureOnNext = function (Document $doc) use (&$insertedFromSource, $onNext): void {
$insertedFromSource[] = $doc;
if ($onNext !== null) {
$onNext($doc);
}
};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can never capture like this internally, if there were many docs, this loop could easily OOM

Comment on lines +2614 to +2640
/**
* Returns the INSERT keyword, optionally with IGNORE for duplicate handling.
* Override in adapter subclasses for DB-specific syntax.
*/
protected function getInsertKeyword(): string
{
return $this->skipDuplicates ? 'INSERT IGNORE INTO' : 'INSERT INTO';
}

/**
* Returns a suffix appended after VALUES clause for duplicate handling.
* Override in adapter subclasses (e.g., Postgres uses ON CONFLICT DO NOTHING).
*/
protected function getInsertSuffix(string $table): string
{
return '';
}

/**
* Returns a suffix for the permissions INSERT statement when ignoring duplicates.
* Override in adapter subclasses for DB-specific syntax.
*/
protected function getInsertPermissionsSuffix(): string
{
return '';
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's group logically with other getter methods

#1 Drop $enable flag on skipDuplicates() scope guard
The $enable param made every non-skipDuplicates createDocuments call pay
for a closure allocation + extra function call. Branch at the call site
instead so the cost only applies when the flag is actually set.

  - Adapter::skipDuplicates(callable, bool) → skipDuplicates(callable)
  - Database::skipDuplicates(callable, bool) → skipDuplicates(callable)
  - Database::createDocuments, Mirror::createDocuments, Pool::delegate,
    Pool::withTransaction now branch inline.

#2 Drop fetchExistingByIds helper, inline find()
The helper's per-tenant grouping defended a hypothetical multi-tenant
batching scenario that no caller exercises (relationships are intra-
tenant, callers always batch per tenant). Existing patterns in the same
file (refetchDocuments, relationship loading) just call find() directly.
Match that idiom and drop ~70 lines.

#4 Mirror: only capture inserted docs in skipDuplicates mode
The captureOnNext accumulator paid the cost (closure + per-doc array
push) on every createDocuments call, including the common non-skip path.
Branch at the entry of Mirror::createDocuments so the capture only
happens when skipDuplicates is set; the non-skip path passes through
to source/destination unchanged.

#5 Move getInsertKeyword/Suffix/PermissionsSuffix to getters cluster
Were sitting next to createDocuments(); moved to the getSupport*
cluster around line 1030 where other adapter-capability shims live.

Not addressed:
- #2 partial: the existing patterns (refetchDocuments etc.) don't handle
  tenant-per-document multi-tenant batches either, so this is consistent.
- #3 (drop the pre-filter): rejected. createDocumentRelationships runs
  in the encoding loop BEFORE the adapter's INSERT IGNORE no-ops the
  parent, so dropping the pre-filter would deterministically duplicate
  child rows on every CSV re-import of a collection with relationships
  (not a race window — every call). The relationships test verifies
  this. Reverting would require reintroducing the deferred-relationships
  scaffolding we just removed, and the adapter still couldn't tell us
  which parents were actually inserted (SQL INSERT IGNORE has no per-row
  reporting). Pre-filter stays.
Per Jake's #4 review comment, the captureOnNext accumulator could grow
unbounded with the full inserted set. Replace it with a bounded buffer
that flushes to the destination every $batchSize docs as the source
streams them through onNext, plus a tail flush after the source returns.

Memory peak in skipDuplicates mode is now O($batchSize) (~1000 docs)
regardless of how many docs the source inserts. Same destination batching
efficiency — destination still gets multi-doc createDocuments calls,
just one per chunk instead of one giant call at the end.

Eligibility checks (SOURCE_ONLY_COLLECTIONS, upgrade status) moved
upfront so the flush closure can capture the decision and short-circuit
cleanly. Non-skip path is unchanged: direct passthrough, no buffer.

The flush-on-fill is load-bearing for the memory bound — see the
in-method comment.
…erDocument

When inlining find() per Jake's review, the per-tenant grouping the
deleted fetchExistingByIds helper used to do was lost. In tenantPerDocument
mode with a multi-tenant batch, the inlined find() runs under the session
tenant context (often null for platform workers) and silently misses rows
belonging to other tenants — every input doc looks "new" and goes down the
create path instead of the update path.

This actively breaks appwrite's StatsUsage worker, which accumulates stats
across many projects (each tagged with its own $tenant) and flushes them
via $dbForLogs->setTenant(null)->setTenantPerDocument(true) followed by
upsertDocumentsWithIncrease(...). With the old per-doc getDocument loop,
each lookup ran under withTenant($doc->getTenant()) and was correct. The
batch find() inlining lost that scoping.

Fix per Greptile's suggestion: branch on tenantPerDocument mode and run
one find() per unique tenant value under withTenant, merging results into
the same lookup hash. K queries (K = unique tenants in the batch) instead
of N (per-doc) or 1 (broken). Common single-tenant case still hits the
fast batched path.

Applied at both call sites:
- Database::createDocuments skipDuplicates pre-filter
- Database::upsertDocumentsWithIncrease existing-docs lookup

Fixes greptile flag at PR #852 discussion r3077481595.
Per Jake's review feedback, the pre-filter in Database::createDocuments
was redundant with the adapter's INSERT IGNORE / ON CONFLICT DO NOTHING
/ upsert + \$setOnInsert path. The adapter already handles duplicates
atomically; the upfront SELECT duplicated that work.

The semantic is now row-level dedup:
- pre-existing rows are silently no-op'd by the adapter
- new rows go through
- retries are idempotent
- source data is never overwritten

Tests updated to match:
- 5 skipDuplicates tests adjusted to expect full-chunk counts, with
  data-level assertions unchanged (pre-existing values preserved, new
  rows inserted).
- testCreateDocumentsSkipDuplicatesRelationships encodes the retry
  scenario: missing nested child gets created via relateDocuments'
  idempotent path when the parent already exists.
- testCreateDocumentsSkipDuplicatesDoesNotDivergeDestination relaxed
  to source-authoritative semantics only. Destination may transiently
  hold would-be values during the migration window; backfill reconciles.

All 105 skipDuplicates tests across 11 adapter variants pass, full
Mirror suite (636 tests) passes, PHPStan level 7 clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants