Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions scripts/mem-repro/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
results/
77 changes: 77 additions & 0 deletions scripts/mem-repro/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Passthrough memory reproduction harness

Tooling used to reproduce and root-cause the passthrough memory leak (BUG-300):
Proxy leaks ~1 KB per statement when running in **passthrough with an empty
encrypt config** (no encryption configured), until the pod is OOM-killed.

See [`summary.md`](summary.md) for the write-up, root cause, and fix.

## Layout

| Path | What |
| --- | --- |
| `soak.sh` | Sustained load + RSS sampling — the script that surfaces the leak |
| `run-docker.sh` | One load run + cgroup sampling against the `proxy` container |
| `run.sh` | Same, but for a proxy running as a host process |
| `sample-rss.sh` | Standalone RSS sampler (host process) |
| `schema.sql` | Plaintext `credit_data_order_v2` table (all traffic is passthrough) |
| `loadgen/` | Go load generator (parameterised INSERTs via pgx) |
| `prepleak/` | Go probe for the named-prepared-statement retention path |
| `go.mod` | Go module covering `loadgen/` and `prepleak/` |

Raw run outputs land in `results/` (git-ignored).

## Prerequisites

- Postgres + a built proxy container (`mise run postgres:up`, `mise run proxy:up`).
- Apply the plaintext schema once: `mise run postgres:psql < scripts/mem-repro/schema.sql`.
- **No local Go needed** — `run-docker.sh` / `soak.sh` run the generator in a
`golang` container. (`run.sh` is the host-process variant and does need Go.)

## Reproducing the leak

The trigger is an **empty encrypt config**. Make the config empty, then soak:

```bash
# 1. Empty the encrypt config (unconfigured passthrough)
mise run postgres:psql -c "UPDATE eql_v2_configuration SET state='inactive' WHERE state='active';"
# 2. (re)start the proxy so it loads the empty config, then drive sustained load
scripts/mem-repro/soak.sh empty-config 6 50000 16
```

`soak.sh <label> [rounds] [count] [workers]` runs N rounds of inserts and samples
the proxy container's cgroup `memory.current` throughout (plus a cooldown to see
whether RSS is released). On an affected build RSS climbs monotonically and does
not drop; on a fixed build it plateaus.

## Variants

Compare by restarting the proxy with different env / config, then re-running the soak:

| Variant | How |
| --- | --- |
| baseline (glibc) | empty config, default settings |
| `MALLOC_ARENA_MAX=2` | set the env on the proxy — tests glibc arena bloat (ruled out) |
| `DISABLE_MAPPING` | `CS_DEVELOPMENT__DISABLE_MAPPING=true` — confirmed leak-free (the workaround) |
| config present (control) | restore an `active` row in `eql_v2_configuration` — no leak |

> An allocator comparison (jemalloc) was also run during the investigation and
> showed no difference (the leak is logical retention, not fragmentation). The
> jemalloc global-allocator feature is **not** included here — it is tracked as a
> separate decision.

## Named-statement probe

```bash
# inside the golang container, or locally from this dir:
go run ./prepleak -count 100000 -dsn '<dsn>'
```
Exercises the separate (bounded, freed-on-disconnect) named-prepared-statement
retention noted in `analysis.md`.

## Notes

- Run on Linux for production-faithful numbers (glibc arena behaviour, cgroup
memory). macOS is directional only.
- Default proxy log level is `debug`, which is allocation-heavy; set
`CS_LOG__LEVEL=warn` for clean memory measurements.
19 changes: 19 additions & 0 deletions scripts/mem-repro/go.mod
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
// Go module for the memory-reproduction load generators (loadgen/, prepleak/).
// Not part of the Rust build; used only by scripts/mem-repro.
module cipherstash-proxy-mem-repro

go 1.22

require (
github.com/google/uuid v1.6.0
github.com/jackc/pgx/v5 v5.7.2
)

require (
github.com/jackc/pgpassfile v1.0.0 // indirect
github.com/jackc/pgservicefile v0.0.0-20240606120523-5a60cdf6a761 // indirect
github.com/jackc/puddle/v2 v2.2.2 // indirect
golang.org/x/crypto v0.31.0 // indirect
golang.org/x/sync v0.10.0 // indirect
golang.org/x/text v0.21.0 // indirect
)
30 changes: 30 additions & 0 deletions scripts/mem-repro/go.sum
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
github.com/jackc/pgpassfile v1.0.0 h1:/6Hmqy13Ss2zCq62VdNG8tM1wchn8zjSGOBJ6icpsIM=
github.com/jackc/pgpassfile v1.0.0/go.mod h1:CEx0iS5ambNFdcRtxPj5JhEz+xB6uRky5eyVu/W2HEg=
github.com/jackc/pgservicefile v0.0.0-20240606120523-5a60cdf6a761 h1:iCEnooe7UlwOQYpKFhBabPMi4aNAfoODPEFNiAnClxo=
github.com/jackc/pgservicefile v0.0.0-20240606120523-5a60cdf6a761/go.mod h1:5TJZWKEWniPve33vlWYSoGYefn3gLQRzjfDlhSJ9ZKM=
github.com/jackc/pgx/v5 v5.7.2 h1:mLoDLV6sonKlvjIEsV56SkWNCnuNv531l94GaIzO+XI=
github.com/jackc/pgx/v5 v5.7.2/go.mod h1:ncY89UGWxg82EykZUwSpUKEfccBGGYq1xjrOpsbsfGQ=
github.com/jackc/puddle/v2 v2.2.2 h1:PR8nw+E/1w0GLuRFSmiioY6UooMp6KJv0/61nB7icHo=
github.com/jackc/puddle/v2 v2.2.2/go.mod h1:vriiEXHvEE654aYKXXjOvZM39qJ0q+azkZFrfEOc3H4=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI=
github.com/stretchr/testify v1.7.0/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
github.com/stretchr/testify v1.8.1 h1:w7B6lhMri9wdJUVmEZPGGhZzrYTPvgJArz7wNPgYKsk=
github.com/stretchr/testify v1.8.1/go.mod h1:w2LPCIKwWwSfY2zedu0+kehJoqGctiVI29o6fzry7u4=
golang.org/x/crypto v0.31.0 h1:ihbySMvVjLAeSH1IbfcRTkD/iNscyz8rGzjF/E5hV6U=
golang.org/x/crypto v0.31.0/go.mod h1:kDsLvtWBEx7MV9tJOj9bnXsPbxwJQ6csT/x4KIN4Ssk=
golang.org/x/sync v0.10.0 h1:3NQrjDixjgGwUOCaF8w2+VYHv0Ve/vGYSbdkTa98gmQ=
golang.org/x/sync v0.10.0/go.mod h1:Czt+wKu1gCyEFDUtn0jG5QVvpJ6rzVqr5aXyt9drQfk=
golang.org/x/text v0.21.0 h1:zyQAAkrwaneQ066sspRyJaG9VNi/YJ1NfzcGB3hZ/qo=
golang.org/x/text v0.21.0/go.mod h1:4IBbMaMmOPCJ8SecivzSH54+73PCFmPWxNTLm+vZkEQ=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
209 changes: 209 additions & 0 deletions scripts/mem-repro/loadgen/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
package main

import (
"context"
"flag"
"fmt"
"log"
"sync"
"sync/atomic"
"time"

"github.com/google/uuid"
"github.com/jackc/pgx/v5/pgxpool"
)

func main() {
var (
dsn = flag.String("dsn", "postgres://bloomuser:password@127.0.0.1:6432/pdts_db?sslmode=disable", "Database connection string (DSN)")
count = flag.Int("count", 10000, "Number of inserts to perform")
workers = flag.Int("workers", 10, "Number of concurrent workers")
)
flag.Parse()

if *dsn == "" {
log.Fatal("DSN is required. Use -dsn flag to provide database connection string")
Comment on lines +17 to +25
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate -count and -workers as positive values.

Right now Line 143 will panic for a negative -count, and -workers=0 falls through to a successful Done! without inserting anything. This harness should fail fast on invalid inputs instead of producing misleading no-op runs.

Proposed fix
 	flag.Parse()
 
 	if *dsn == "" {
 		log.Fatal("DSN is required. Use -dsn flag to provide database connection string")
 	}
+	if *count <= 0 {
+		log.Fatal("-count must be > 0")
+	}
+	if *workers <= 0 {
+		log.Fatal("-workers must be > 0")
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
var (
dsn = flag.String("dsn", "postgres://bloomuser:password@127.0.0.1:6432/pdts_db?sslmode=disable", "Database connection string (DSN)")
count = flag.Int("count", 10000, "Number of inserts to perform")
workers = flag.Int("workers", 10, "Number of concurrent workers")
)
flag.Parse()
if *dsn == "" {
log.Fatal("DSN is required. Use -dsn flag to provide database connection string")
var (
dsn = flag.String("dsn", "postgres://bloomuser:password@127.0.0.1:6432/pdts_db?sslmode=disable", "Database connection string (DSN)")
count = flag.Int("count", 10000, "Number of inserts to perform")
workers = flag.Int("workers", 10, "Number of concurrent workers")
)
flag.Parse()
if *dsn == "" {
log.Fatal("DSN is required. Use -dsn flag to provide database connection string")
}
if *count <= 0 {
log.Fatal("-count must be > 0")
}
if *workers <= 0 {
log.Fatal("-workers must be > 0")
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/mem-repro/loadgen/main.go` around lines 17 - 25, The flags validation
is incomplete: after flag.Parse() add checks that *count and *workers are
positive (>0) and fail fast; specifically, validate the variables count and
workers (and keep the existing dsn check) and call log.Fatal with clear messages
like "count must be > 0" or "workers must be > 0" when invalid so main (and the
insert loop that uses *count and worker goroutines that rely on *workers) never
proceeds with negative or zero values.

}

ctx := context.Background()

pool, err := pgxpool.New(ctx, *dsn)
if err != nil {
log.Fatal("Failed to create pgx pool:", err)
}
defer pool.Close()

if err := pool.Ping(ctx); err != nil {
log.Fatal("Failed to ping database:", err)
}

log.Printf("Connected to database, inserting %d records with %d workers...", *count, *workers)

orgID := "539008ae-e1ff-42ed-8a58-e3588befea9d"
testJSON := []byte(`{
"reportId": "RPT-2024-001234567890",
"generatedAt": "2024-01-15T14:30:00Z",
"consumer": {
"firstName": "John",
"lastName": "Doe",
"dateOfBirth": "1985-06-15",
"ssn": "XXX-XX-1234",
"addresses": [
{
"type": "current",
"street": "123 Main Street",
"city": "Springfield",
"state": "IL",
"zipCode": "62701",
"since": "2020-03-01"
},
{
"type": "previous",
"street": "456 Oak Avenue",
"city": "Chicago",
"state": "IL",
"zipCode": "60601",
"since": "2015-08-15"
}
],
"employment": {
"employer": "Acme Corporation",
"position": "Software Engineer",
"income": 95000,
"since": "2019-01-15"
}
},
"creditScore": {
"value": 742,
"model": "FICO8",
"range": {"min": 300, "max": 850},
"factors": [
{"code": "01", "description": "Length of time accounts have been established"},
{"code": "14", "description": "Number of accounts with delinquency"},
{"code": "07", "description": "Too many inquiries last 12 months"}
]
},
"accounts": [
{
"accountNumber": "XXXX-XXXX-XXXX-4567",
"creditor": "First National Bank",
"type": "creditCard",
"status": "open",
"openDate": "2018-05-20",
"creditLimit": 15000,
"balance": 3250,
"monthlyPayment": 150,
"paymentHistory": ["OK","OK","OK","OK","OK","OK","OK","OK","OK","OK","OK","OK"]
},
{
"accountNumber": "LOAN-789012",
"creditor": "Auto Finance LLC",
"type": "autoLoan",
"status": "open",
"openDate": "2022-02-10",
"originalAmount": 28000,
"balance": 18500,
"monthlyPayment": 485,
"paymentHistory": ["OK","OK","OK","OK","OK","OK","OK","OK","OK","OK","OK","OK"]
},
{
"accountNumber": "MTG-555666777",
"creditor": "Home Mortgage Corp",
"type": "mortgage",
"status": "open",
"openDate": "2020-03-15",
"originalAmount": 320000,
"balance": 295000,
"monthlyPayment": 1850,
"paymentHistory": ["OK","OK","OK","OK","OK","OK","OK","OK","OK","OK","OK","OK"]
}
],
"inquiries": [
{"date": "2024-01-10", "creditor": "Capital One", "type": "hard"},
{"date": "2023-11-05", "creditor": "Chase Bank", "type": "hard"},
{"date": "2023-09-20", "creditor": "American Express", "type": "soft"}
],
"publicRecords": [],
"collections": [],
"metadata": {
"version": "2.1.0",
"provider": "TestProvider",
"requestId": "REQ-2024-ABCDEF123456",
"processingTimeMs": 245
}
}`)

query := `
INSERT INTO credit_data_order_v2 (id, organization_id, order_id, account_review, full_report, raw_report, created_at, updated_at)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8)
`

var completed int64
var wg sync.WaitGroup
jobs := make(chan int, *count)

// Start workers
for w := 0; w < *workers; w++ {
wg.Add(1)
go func() {
defer wg.Done()
for range jobs {
id := uuid.New().String()
orderID := uuid.New().String()
now := time.Now()

tx, err := pool.Begin(ctx)
if err != nil {
log.Printf("Insert failed to begin tx: %v", err)
atomic.AddInt64(&completed, 1)
continue
}

// NOTE: The original repro issued `SET CIPHERSTASH.KEYSET_NAME`
// here to mirror repo.setKeysetNameOnTx. For the passthrough
// (no-encryption) memory test that SET is omitted: this proxy has
// a default keyset configured, which rejects the SET, and it is
// irrelevant to the per-statement type-check churn we measure.

_, err = tx.Exec(ctx, query,
id,
orgID,
orderID,
false,
testJSON,
testJSON,
now,
now,
)
if err != nil {
tx.Rollback(ctx)
log.Printf("Insert failed: %v", err)
atomic.AddInt64(&completed, 1)
continue
}

if err := tx.Commit(ctx); err != nil {
log.Printf("Insert failed to commit: %v", err)
atomic.AddInt64(&completed, 1)
continue
Comment on lines +155 to +188
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail the run when a worker hits a database error.

Lines 156-188 currently count begin/exec/commit failures as completed work and still let main() exit 0 with Done!. That makes a broken repro look successful and can invalidate any RSS series collected around it. Please propagate the first worker error back to main and return a non-zero exit instead of continuing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/mem-repro/loadgen/main.go` around lines 155 - 188, The worker
currently treats Begin/Exec/Commit failures as "completed"
(atomic.AddInt64(&completed, 1)) and continues, causing main() to finish with
exit 0 even on DB errors; change the worker to propagate the first error back to
main by sending the error (with context cancel or a buffered errCh and
sync.Once) whenever pool.Begin(ctx), tx.Exec(ctx, ...), or tx.Commit(ctx) fail
(and still rollback where appropriate), stop incrementing completed for failed
attempts so completed only counts successful commits, and have main() listen for
that error and return/exit non-zero on the first reported error.

}
Comment on lines +185 to +189
Comment on lines +185 to +189

c := atomic.AddInt64(&completed, 1)
if c%100 == 0 || c == int64(*count) {
log.Printf("Inserted %d/%d records", c, *count)
}
}
}()
}

// Send jobs
for i := 0; i < *count; i++ {
jobs <- i
}
close(jobs)

// Wait for all workers to finish
wg.Wait()

fmt.Println("Done!")
}
43 changes: 43 additions & 0 deletions scripts/mem-repro/prepleak/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
// Tests the one unbounded code path identified in the proxy: named prepared
// statements are only removed from the per-connection statements map on an
// explicit Close/Deallocate. A client that prepares many uniquely-named
// statements on a single long-lived connection (and never deallocates) makes
// that map grow without bound.
//
// This uses a SINGLE connection and prepares -count uniquely-named statements.
package main

import (
"context"
"flag"
"fmt"
"log"

"github.com/jackc/pgx/v5"
)

func main() {
dsn := flag.String("dsn", "postgres://cipherstash:password@proxy:6432/cipherstash?sslmode=disable", "DSN")
count := flag.Int("count", 100000, "number of uniquely-named prepared statements")
flag.Parse()

ctx := context.Background()
conn, err := pgx.Connect(ctx, *dsn)
if err != nil {
log.Fatal("connect:", err)
}
defer conn.Close(ctx)

for i := 0; i < *count; i++ {
name := fmt.Sprintf("stmt_%d", i)
// Unique name, same SQL. Never deallocated -> proxy retains one entry per name.
_, err := conn.Prepare(ctx, name, "SELECT id, full_report FROM credit_data_order_v2 WHERE id = $1")
if err != nil {
log.Fatalf("prepare %d: %v", i, err)
}
if (i+1)%10000 == 0 {
log.Printf("prepared %d/%d", i+1, *count)
}
}
log.Printf("Done: %d uniquely-named prepared statements on one connection", *count)
}
Loading
Loading