Skip to content

Fix C client SSL test hang on long-hostname runners (cap cert CN at 64 chars)#147

Open
arkmish wants to merge 1 commit into
branch-3.6from
armishra/fix-hanging-tests
Open

Fix C client SSL test hang on long-hostname runners (cap cert CN at 64 chars)#147
arkmish wants to merge 1 commit into
branch-3.6from
armishra/fix-hanging-tests

Conversation

@arkmish
Copy link
Copy Markdown

@arkmish arkmish commented Jun 5, 2026

What this changes

One line in zookeeper-client/zookeeper-client-c/ssl/gencerts.sh: cap the certificate CommonName at 64 characters.

Root cause

gencerts.sh derives the X.509 CommonName from hostname -f. On runners with a long FQDN — e.g. Kubernetes pods / self-hosted CI runners whose names look like rdev-aks-…-cq4mx.corp.rdev.svc.cluster.local (~80 chars) — the CN exceeds the 64-character ASN.1 limit, so every openssl req fails with string too long:maxsize=64 and no certificates are generated.

The Java test server then silently skips its secure port (22281 never listens), and Zookeeper_simpleSystem::testSSL (TestClient.cc) issues a synchronous zoo_create against 127.0.0.1:22281 whose session can never establish. Synchronous C-client ops have no per-call timeout, so the IO thread spins on connect-refused and the main thread blocks forever in wait_sync_completionhanging the full-build-java-tests and full-build-cppunit-tests jobs indefinitely (camping on runners until the 6h timeout).

This is why it reproduced only on long-FQDN runners and not on upstream CI (short hostnames), and why it had nothing to do with the C client itself.

Fix

Truncate the CN-deriving FQDN to 64 characters (FQDN=${FQDN:0:64}). The CN isn't used for hostname verification in these tests (the client connects to 127.0.0.1), so any valid ≤64-char value is sufficient.

Testing Done

  • Diagnosed via a gdb thread dump on a long-FQDN Linux runner: main thread blocked in wait_sync_completion, IO thread spinning on connect-refused to 127.0.0.1:22281; /tmp/certs/gencerts.stderr showed the string too long:maxsize=64 openssl failure and ss -ltn confirmed 22281 was not listening.
  • Verified locally: the original 80-char CN reproduces the exact string too long:maxsize=64 error; with the truncation, the full gencerts.sh run completes and generates root.crt / server.crt / client.crt + keystores.
  • Verified on CI: with the fix, both full-build-cppunit-tests (8m22s) and full-build-java-tests jobs completed green — the exact jobs that previously hung for hours.

🤖 Generated with Claude Code

…4 chars)

Root cause (confirmed via gdb thread dump on a long-FQDN Linux runner):

gencerts.sh derives the X.509 CommonName from `hostname -f`. On hosts with a
long FQDN -- e.g. Kubernetes pods / self-hosted CI runners whose names look like
rdev-aks-...-cq4mx.corp.rdev.svc.cluster.local (~80 chars) -- the CN exceeds the
64-character ASN.1 limit, so every `openssl req` fails with
"string too long:maxsize=64" and NO certificates are produced. The Java test
server then silently skips its secure port (22281 never listens), and
Zookeeper_simpleSystem::testSSL (TestClient.cc) issues a synchronous zoo_create
against 127.0.0.1:22281 whose session can never establish. Synchronous ops on
such a handle have no per-call timeout, so the IO thread spins on connect-refused
and the main thread blocks forever in wait_sync_completion -- hanging the entire
full-build-java-tests / full-build-cppunit-tests cppunit run.

This is why the hang only reproduced on long-FQDN runners and not on upstream CI
(short hostnames), and why it had nothing to do with the C client itself.

Fix: truncate the CN-deriving FQDN to 64 characters in gencerts.sh.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@arkmish arkmish requested review from Sanju98 and laxman-ch June 5, 2026 02:00
Copy link
Copy Markdown

@Sanju98 Sanju98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants