-
Notifications
You must be signed in to change notification settings - Fork 1
test(mem-repro): passthrough memory-leak reproduction harness #396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| results/ |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| # Passthrough memory reproduction harness | ||
|
|
||
| Tooling used to reproduce and root-cause the passthrough memory leak (BUG-300): | ||
| Proxy leaks ~1 KB per statement when running in **passthrough with an empty | ||
| encrypt config** (no encryption configured), until the pod is OOM-killed. | ||
|
|
||
| See [`summary.md`](summary.md) for the write-up, root cause, and fix. | ||
|
|
||
| ## Layout | ||
|
|
||
| | Path | What | | ||
| | --- | --- | | ||
| | `soak.sh` | Sustained load + RSS sampling — the script that surfaces the leak | | ||
| | `run-docker.sh` | One load run + cgroup sampling against the `proxy` container | | ||
| | `run.sh` | Same, but for a proxy running as a host process | | ||
| | `sample-rss.sh` | Standalone RSS sampler (host process) | | ||
| | `schema.sql` | Plaintext `credit_data_order_v2` table (all traffic is passthrough) | | ||
| | `loadgen/` | Go load generator (parameterised INSERTs via pgx) | | ||
| | `prepleak/` | Go probe for the named-prepared-statement retention path | | ||
| | `go.mod` | Go module covering `loadgen/` and `prepleak/` | | ||
|
|
||
| Raw run outputs land in `results/` (git-ignored). | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - Postgres + a built proxy container (`mise run postgres:up`, `mise run proxy:up`). | ||
| - Apply the plaintext schema once: `mise run postgres:psql < scripts/mem-repro/schema.sql`. | ||
| - **No local Go needed** — `run-docker.sh` / `soak.sh` run the generator in a | ||
| `golang` container. (`run.sh` is the host-process variant and does need Go.) | ||
|
|
||
| ## Reproducing the leak | ||
|
|
||
| The trigger is an **empty encrypt config**. Make the config empty, then soak: | ||
|
|
||
| ```bash | ||
| # 1. Empty the encrypt config (unconfigured passthrough) | ||
| mise run postgres:psql -c "UPDATE eql_v2_configuration SET state='inactive' WHERE state='active';" | ||
| # 2. (re)start the proxy so it loads the empty config, then drive sustained load | ||
| scripts/mem-repro/soak.sh empty-config 6 50000 16 | ||
| ``` | ||
|
|
||
| `soak.sh <label> [rounds] [count] [workers]` runs N rounds of inserts and samples | ||
| the proxy container's cgroup `memory.current` throughout (plus a cooldown to see | ||
| whether RSS is released). On an affected build RSS climbs monotonically and does | ||
| not drop; on a fixed build it plateaus. | ||
|
|
||
| ## Variants | ||
|
|
||
| Compare by restarting the proxy with different env / config, then re-running the soak: | ||
|
|
||
| | Variant | How | | ||
| | --- | --- | | ||
| | baseline (glibc) | empty config, default settings | | ||
| | `MALLOC_ARENA_MAX=2` | set the env on the proxy — tests glibc arena bloat (ruled out) | | ||
| | `DISABLE_MAPPING` | `CS_DEVELOPMENT__DISABLE_MAPPING=true` — confirmed leak-free (the workaround) | | ||
| | config present (control) | restore an `active` row in `eql_v2_configuration` — no leak | | ||
|
|
||
| > An allocator comparison (jemalloc) was also run during the investigation and | ||
| > showed no difference (the leak is logical retention, not fragmentation). The | ||
| > jemalloc global-allocator feature is **not** included here — it is tracked as a | ||
| > separate decision. | ||
|
|
||
| ## Named-statement probe | ||
|
|
||
| ```bash | ||
| # inside the golang container, or locally from this dir: | ||
| go run ./prepleak -count 100000 -dsn '<dsn>' | ||
| ``` | ||
| Exercises the separate (bounded, freed-on-disconnect) named-prepared-statement | ||
| retention noted in `analysis.md`. | ||
|
|
||
| ## Notes | ||
|
|
||
| - Run on Linux for production-faithful numbers (glibc arena behaviour, cgroup | ||
| memory). macOS is directional only. | ||
| - Default proxy log level is `debug`, which is allocation-heavy; set | ||
| `CS_LOG__LEVEL=warn` for clean memory measurements. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| // Go module for the memory-reproduction load generators (loadgen/, prepleak/). | ||
| // Not part of the Rust build; used only by scripts/mem-repro. | ||
| module cipherstash-proxy-mem-repro | ||
|
|
||
| go 1.22 | ||
|
|
||
| require ( | ||
| github.com/google/uuid v1.6.0 | ||
| github.com/jackc/pgx/v5 v5.7.2 | ||
| ) | ||
|
|
||
| require ( | ||
| github.com/jackc/pgpassfile v1.0.0 // indirect | ||
| github.com/jackc/pgservicefile v0.0.0-20240606120523-5a60cdf6a761 // indirect | ||
| github.com/jackc/puddle/v2 v2.2.2 // indirect | ||
| golang.org/x/crypto v0.31.0 // indirect | ||
| golang.org/x/sync v0.10.0 // indirect | ||
| golang.org/x/text v0.21.0 // indirect | ||
| ) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= | ||
| github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c= | ||
| github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= | ||
| github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0= | ||
| github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo= | ||
| github.com/jackc/pgpassfile v1.0.0 h1:/6Hmqy13Ss2zCq62VdNG8tM1wchn8zjSGOBJ6icpsIM= | ||
| github.com/jackc/pgpassfile v1.0.0/go.mod h1:CEx0iS5ambNFdcRtxPj5JhEz+xB6uRky5eyVu/W2HEg= | ||
| github.com/jackc/pgservicefile v0.0.0-20240606120523-5a60cdf6a761 h1:iCEnooe7UlwOQYpKFhBabPMi4aNAfoODPEFNiAnClxo= | ||
| github.com/jackc/pgservicefile v0.0.0-20240606120523-5a60cdf6a761/go.mod h1:5TJZWKEWniPve33vlWYSoGYefn3gLQRzjfDlhSJ9ZKM= | ||
| github.com/jackc/pgx/v5 v5.7.2 h1:mLoDLV6sonKlvjIEsV56SkWNCnuNv531l94GaIzO+XI= | ||
| github.com/jackc/pgx/v5 v5.7.2/go.mod h1:ncY89UGWxg82EykZUwSpUKEfccBGGYq1xjrOpsbsfGQ= | ||
| github.com/jackc/puddle/v2 v2.2.2 h1:PR8nw+E/1w0GLuRFSmiioY6UooMp6KJv0/61nB7icHo= | ||
| github.com/jackc/puddle/v2 v2.2.2/go.mod h1:vriiEXHvEE654aYKXXjOvZM39qJ0q+azkZFrfEOc3H4= | ||
| github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM= | ||
| github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= | ||
| github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME= | ||
| github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI= | ||
| github.com/stretchr/testify v1.7.0/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= | ||
| github.com/stretchr/testify v1.8.1 h1:w7B6lhMri9wdJUVmEZPGGhZzrYTPvgJArz7wNPgYKsk= | ||
| github.com/stretchr/testify v1.8.1/go.mod h1:w2LPCIKwWwSfY2zedu0+kehJoqGctiVI29o6fzry7u4= | ||
| golang.org/x/crypto v0.31.0 h1:ihbySMvVjLAeSH1IbfcRTkD/iNscyz8rGzjF/E5hV6U= | ||
| golang.org/x/crypto v0.31.0/go.mod h1:kDsLvtWBEx7MV9tJOj9bnXsPbxwJQ6csT/x4KIN4Ssk= | ||
| golang.org/x/sync v0.10.0 h1:3NQrjDixjgGwUOCaF8w2+VYHv0Ve/vGYSbdkTa98gmQ= | ||
| golang.org/x/sync v0.10.0/go.mod h1:Czt+wKu1gCyEFDUtn0jG5QVvpJ6rzVqr5aXyt9drQfk= | ||
| golang.org/x/text v0.21.0 h1:zyQAAkrwaneQ066sspRyJaG9VNi/YJ1NfzcGB3hZ/qo= | ||
| golang.org/x/text v0.21.0/go.mod h1:4IBbMaMmOPCJ8SecivzSH54+73PCFmPWxNTLm+vZkEQ= | ||
| gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= | ||
| gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= | ||
| gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= | ||
| gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,209 @@ | ||
| package main | ||
|
|
||
| import ( | ||
| "context" | ||
| "flag" | ||
| "fmt" | ||
| "log" | ||
| "sync" | ||
| "sync/atomic" | ||
| "time" | ||
|
|
||
| "github.com/google/uuid" | ||
| "github.com/jackc/pgx/v5/pgxpool" | ||
| ) | ||
|
|
||
| func main() { | ||
| var ( | ||
| dsn = flag.String("dsn", "postgres://bloomuser:password@127.0.0.1:6432/pdts_db?sslmode=disable", "Database connection string (DSN)") | ||
| count = flag.Int("count", 10000, "Number of inserts to perform") | ||
| workers = flag.Int("workers", 10, "Number of concurrent workers") | ||
| ) | ||
| flag.Parse() | ||
|
|
||
| if *dsn == "" { | ||
| log.Fatal("DSN is required. Use -dsn flag to provide database connection string") | ||
| } | ||
|
|
||
| ctx := context.Background() | ||
|
|
||
| pool, err := pgxpool.New(ctx, *dsn) | ||
| if err != nil { | ||
| log.Fatal("Failed to create pgx pool:", err) | ||
| } | ||
| defer pool.Close() | ||
|
|
||
| if err := pool.Ping(ctx); err != nil { | ||
| log.Fatal("Failed to ping database:", err) | ||
| } | ||
|
|
||
| log.Printf("Connected to database, inserting %d records with %d workers...", *count, *workers) | ||
|
|
||
| orgID := "539008ae-e1ff-42ed-8a58-e3588befea9d" | ||
| testJSON := []byte(`{ | ||
| "reportId": "RPT-2024-001234567890", | ||
| "generatedAt": "2024-01-15T14:30:00Z", | ||
| "consumer": { | ||
| "firstName": "John", | ||
| "lastName": "Doe", | ||
| "dateOfBirth": "1985-06-15", | ||
| "ssn": "XXX-XX-1234", | ||
| "addresses": [ | ||
| { | ||
| "type": "current", | ||
| "street": "123 Main Street", | ||
| "city": "Springfield", | ||
| "state": "IL", | ||
| "zipCode": "62701", | ||
| "since": "2020-03-01" | ||
| }, | ||
| { | ||
| "type": "previous", | ||
| "street": "456 Oak Avenue", | ||
| "city": "Chicago", | ||
| "state": "IL", | ||
| "zipCode": "60601", | ||
| "since": "2015-08-15" | ||
| } | ||
| ], | ||
| "employment": { | ||
| "employer": "Acme Corporation", | ||
| "position": "Software Engineer", | ||
| "income": 95000, | ||
| "since": "2019-01-15" | ||
| } | ||
| }, | ||
| "creditScore": { | ||
| "value": 742, | ||
| "model": "FICO8", | ||
| "range": {"min": 300, "max": 850}, | ||
| "factors": [ | ||
| {"code": "01", "description": "Length of time accounts have been established"}, | ||
| {"code": "14", "description": "Number of accounts with delinquency"}, | ||
| {"code": "07", "description": "Too many inquiries last 12 months"} | ||
| ] | ||
| }, | ||
| "accounts": [ | ||
| { | ||
| "accountNumber": "XXXX-XXXX-XXXX-4567", | ||
| "creditor": "First National Bank", | ||
| "type": "creditCard", | ||
| "status": "open", | ||
| "openDate": "2018-05-20", | ||
| "creditLimit": 15000, | ||
| "balance": 3250, | ||
| "monthlyPayment": 150, | ||
| "paymentHistory": ["OK","OK","OK","OK","OK","OK","OK","OK","OK","OK","OK","OK"] | ||
| }, | ||
| { | ||
| "accountNumber": "LOAN-789012", | ||
| "creditor": "Auto Finance LLC", | ||
| "type": "autoLoan", | ||
| "status": "open", | ||
| "openDate": "2022-02-10", | ||
| "originalAmount": 28000, | ||
| "balance": 18500, | ||
| "monthlyPayment": 485, | ||
| "paymentHistory": ["OK","OK","OK","OK","OK","OK","OK","OK","OK","OK","OK","OK"] | ||
| }, | ||
| { | ||
| "accountNumber": "MTG-555666777", | ||
| "creditor": "Home Mortgage Corp", | ||
| "type": "mortgage", | ||
| "status": "open", | ||
| "openDate": "2020-03-15", | ||
| "originalAmount": 320000, | ||
| "balance": 295000, | ||
| "monthlyPayment": 1850, | ||
| "paymentHistory": ["OK","OK","OK","OK","OK","OK","OK","OK","OK","OK","OK","OK"] | ||
| } | ||
| ], | ||
| "inquiries": [ | ||
| {"date": "2024-01-10", "creditor": "Capital One", "type": "hard"}, | ||
| {"date": "2023-11-05", "creditor": "Chase Bank", "type": "hard"}, | ||
| {"date": "2023-09-20", "creditor": "American Express", "type": "soft"} | ||
| ], | ||
| "publicRecords": [], | ||
| "collections": [], | ||
| "metadata": { | ||
| "version": "2.1.0", | ||
| "provider": "TestProvider", | ||
| "requestId": "REQ-2024-ABCDEF123456", | ||
| "processingTimeMs": 245 | ||
| } | ||
| }`) | ||
|
|
||
| query := ` | ||
| INSERT INTO credit_data_order_v2 (id, organization_id, order_id, account_review, full_report, raw_report, created_at, updated_at) | ||
| VALUES ($1, $2, $3, $4, $5, $6, $7, $8) | ||
| ` | ||
|
|
||
| var completed int64 | ||
| var wg sync.WaitGroup | ||
| jobs := make(chan int, *count) | ||
|
|
||
| // Start workers | ||
| for w := 0; w < *workers; w++ { | ||
| wg.Add(1) | ||
| go func() { | ||
| defer wg.Done() | ||
| for range jobs { | ||
| id := uuid.New().String() | ||
| orderID := uuid.New().String() | ||
| now := time.Now() | ||
|
|
||
| tx, err := pool.Begin(ctx) | ||
| if err != nil { | ||
| log.Printf("Insert failed to begin tx: %v", err) | ||
| atomic.AddInt64(&completed, 1) | ||
| continue | ||
| } | ||
|
|
||
| // NOTE: The original repro issued `SET CIPHERSTASH.KEYSET_NAME` | ||
| // here to mirror repo.setKeysetNameOnTx. For the passthrough | ||
| // (no-encryption) memory test that SET is omitted: this proxy has | ||
| // a default keyset configured, which rejects the SET, and it is | ||
| // irrelevant to the per-statement type-check churn we measure. | ||
|
|
||
| _, err = tx.Exec(ctx, query, | ||
| id, | ||
| orgID, | ||
| orderID, | ||
| false, | ||
| testJSON, | ||
| testJSON, | ||
| now, | ||
| now, | ||
| ) | ||
| if err != nil { | ||
| tx.Rollback(ctx) | ||
| log.Printf("Insert failed: %v", err) | ||
| atomic.AddInt64(&completed, 1) | ||
| continue | ||
| } | ||
|
|
||
| if err := tx.Commit(ctx); err != nil { | ||
| log.Printf("Insert failed to commit: %v", err) | ||
| atomic.AddInt64(&completed, 1) | ||
| continue | ||
|
Comment on lines
+155
to
+188
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fail the run when a worker hits a database error. Lines 156-188 currently count begin/exec/commit failures as completed work and still let 🤖 Prompt for AI Agents |
||
| } | ||
|
Comment on lines
+185
to
+189
Comment on lines
+185
to
+189
|
||
|
|
||
| c := atomic.AddInt64(&completed, 1) | ||
| if c%100 == 0 || c == int64(*count) { | ||
| log.Printf("Inserted %d/%d records", c, *count) | ||
| } | ||
| } | ||
| }() | ||
| } | ||
|
|
||
| // Send jobs | ||
| for i := 0; i < *count; i++ { | ||
| jobs <- i | ||
| } | ||
| close(jobs) | ||
|
|
||
| // Wait for all workers to finish | ||
| wg.Wait() | ||
|
|
||
| fmt.Println("Done!") | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| // Tests the one unbounded code path identified in the proxy: named prepared | ||
| // statements are only removed from the per-connection statements map on an | ||
| // explicit Close/Deallocate. A client that prepares many uniquely-named | ||
| // statements on a single long-lived connection (and never deallocates) makes | ||
| // that map grow without bound. | ||
| // | ||
| // This uses a SINGLE connection and prepares -count uniquely-named statements. | ||
| package main | ||
|
|
||
| import ( | ||
| "context" | ||
| "flag" | ||
| "fmt" | ||
| "log" | ||
|
|
||
| "github.com/jackc/pgx/v5" | ||
| ) | ||
|
|
||
| func main() { | ||
| dsn := flag.String("dsn", "postgres://cipherstash:password@proxy:6432/cipherstash?sslmode=disable", "DSN") | ||
| count := flag.Int("count", 100000, "number of uniquely-named prepared statements") | ||
| flag.Parse() | ||
|
|
||
| ctx := context.Background() | ||
| conn, err := pgx.Connect(ctx, *dsn) | ||
| if err != nil { | ||
| log.Fatal("connect:", err) | ||
| } | ||
| defer conn.Close(ctx) | ||
|
|
||
| for i := 0; i < *count; i++ { | ||
| name := fmt.Sprintf("stmt_%d", i) | ||
| // Unique name, same SQL. Never deallocated -> proxy retains one entry per name. | ||
| _, err := conn.Prepare(ctx, name, "SELECT id, full_report FROM credit_data_order_v2 WHERE id = $1") | ||
| if err != nil { | ||
| log.Fatalf("prepare %d: %v", i, err) | ||
| } | ||
| if (i+1)%10000 == 0 { | ||
| log.Printf("prepared %d/%d", i+1, *count) | ||
| } | ||
| } | ||
| log.Printf("Done: %d uniquely-named prepared statements on one connection", *count) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate
-countand-workersas positive values.Right now Line 143 will panic for a negative
-count, and-workers=0falls through to a successfulDone!without inserting anything. This harness should fail fast on invalid inputs instead of producing misleading no-op runs.Proposed fix
flag.Parse() if *dsn == "" { log.Fatal("DSN is required. Use -dsn flag to provide database connection string") } + if *count <= 0 { + log.Fatal("-count must be > 0") + } + if *workers <= 0 { + log.Fatal("-workers must be > 0") + }📝 Committable suggestion
🤖 Prompt for AI Agents