Fix C client SSL test hang on long-hostname runners (cap cert CN at 64 chars)#147
Open
arkmish wants to merge 1 commit into
Open
Fix C client SSL test hang on long-hostname runners (cap cert CN at 64 chars)#147arkmish wants to merge 1 commit into
arkmish wants to merge 1 commit into
Conversation
…4 chars) Root cause (confirmed via gdb thread dump on a long-FQDN Linux runner): gencerts.sh derives the X.509 CommonName from `hostname -f`. On hosts with a long FQDN -- e.g. Kubernetes pods / self-hosted CI runners whose names look like rdev-aks-...-cq4mx.corp.rdev.svc.cluster.local (~80 chars) -- the CN exceeds the 64-character ASN.1 limit, so every `openssl req` fails with "string too long:maxsize=64" and NO certificates are produced. The Java test server then silently skips its secure port (22281 never listens), and Zookeeper_simpleSystem::testSSL (TestClient.cc) issues a synchronous zoo_create against 127.0.0.1:22281 whose session can never establish. Synchronous ops on such a handle have no per-call timeout, so the IO thread spins on connect-refused and the main thread blocks forever in wait_sync_completion -- hanging the entire full-build-java-tests / full-build-cppunit-tests cppunit run. This is why the hang only reproduced on long-FQDN runners and not on upstream CI (short hostnames), and why it had nothing to do with the C client itself. Fix: truncate the CN-deriving FQDN to 64 characters in gencerts.sh. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this changes
One line in
zookeeper-client/zookeeper-client-c/ssl/gencerts.sh: cap the certificate CommonName at 64 characters.Root cause
gencerts.shderives the X.509 CommonName fromhostname -f. On runners with a long FQDN — e.g. Kubernetes pods / self-hosted CI runners whose names look likerdev-aks-…-cq4mx.corp.rdev.svc.cluster.local(~80 chars) — the CN exceeds the 64-character ASN.1 limit, so everyopenssl reqfails withstring too long:maxsize=64and no certificates are generated.The Java test server then silently skips its secure port (22281 never listens), and
Zookeeper_simpleSystem::testSSL(TestClient.cc) issues a synchronouszoo_createagainst127.0.0.1:22281whose session can never establish. Synchronous C-client ops have no per-call timeout, so the IO thread spins on connect-refused and the main thread blocks forever inwait_sync_completion— hanging thefull-build-java-testsandfull-build-cppunit-testsjobs indefinitely (camping on runners until the 6h timeout).This is why it reproduced only on long-FQDN runners and not on upstream CI (short hostnames), and why it had nothing to do with the C client itself.
Fix
Truncate the CN-deriving FQDN to 64 characters (
FQDN=${FQDN:0:64}). The CN isn't used for hostname verification in these tests (the client connects to127.0.0.1), so any valid ≤64-char value is sufficient.Testing Done
gdbthread dump on a long-FQDN Linux runner: main thread blocked inwait_sync_completion, IO thread spinning on connect-refused to127.0.0.1:22281;/tmp/certs/gencerts.stderrshowed thestring too long:maxsize=64openssl failure andss -ltnconfirmed 22281 was not listening.string too long:maxsize=64error; with the truncation, the fullgencerts.shrun completes and generatesroot.crt/server.crt/client.crt+ keystores.full-build-cppunit-tests(8m22s) andfull-build-java-testsjobs completed green — the exact jobs that previously hung for hours.🤖 Generated with Claude Code