Skip to content

[Bug]: Kafka container fails with 'Container exited with code 126' due to a race between command wait loop and script installation #11682

@teqwve

Description

@teqwve

Module

Kafka

Testcontainers version

1.21.4

Using the latest Testcontainers version?

Yes

Host OS

Linux

Host Arch

x86-64

Docker version

Client:
 Version:           20.10.24+dfsg1
 API version:       1.41
 Go version:        go1.19.8
 Git commit:        297e128
 Built:             Sat Jan  3 00:46:39 2026
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.24+dfsg1
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.19.8
  Git commit:       5d6db84
  Built:            Sat Jan  3 00:46:39 2026
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.20~ds1
  GitCommit:        1.6.20~ds1-1+deb12u2
 runc:
  Version:          1.1.5+ds1
  GitCommit:        1.1.5+ds1-1+deb12u1
 docker-init:
  Version:          0.19.0
  GitCommit:

What happened?

During some CI test runs Kafka container startup fails with Wait strategy failed. Container exited with code 126. The tests are run in parallel, on a relatively small virtual machine, the machine can experience high load during the tests.

The cause seem to be a race-like condition between startup script installation and waiting for the script in container command. The command is set to while [ ! -f /tmp/testcontainers_start.sh ]; do sleep 0.1; done; /tmp/testcontainers_start.sh so the condition only checks if the file exists and it's possible that the command will try to execute the script before it's fully written (or maybe also before it has permissions set correctly - although I'm unsure whether this can happen as I didn't manage to reproduce such exact scenario) resulting in 126 error.

This can be reproduced by removing sleep 0.1 from the loop. When container is created like this:

KafkaContainer kafkaContainer = new KafkaContainer("apache/kafka:3.8.0")
    .withCommand("sh", "-c", "while [ ! -f /tmp/testcontainers_start.sh ]; do true; done; /tmp/testcontainers_start.sh");

container startup repeatedly fails with:

2026-04-10 21:48:06,243 [main] ERROR tc.apache/kafka:3.8.0 - Could not start container
java.lang.IllegalStateException: Wait strategy failed. Container exited with code 126
	at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:525)
	at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:346)
	at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81)
	at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:336)
	at org.testcontainers.containers.GenericContainer.start(GenericContainer.java:322)
(...)
Caused by: org.testcontainers.containers.ContainerLaunchException: Timed out waiting for log output matching '.*Transitioning from RECOVERY to RUNNING.*'
	at org.testcontainers.containers.wait.strategy.LogMessageWaitStrategy.waitUntilReady(LogMessageWaitStrategy.java:47)
	at org.testcontainers.containers.wait.strategy.AbstractWaitStrategy.waitUntilReady(AbstractWaitStrategy.java:52)
	at org.testcontainers.containers.GenericContainer.waitUntilContainerStarted(GenericContainer.java:909)
	at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:492)
	... 58 common frames omitted
2026-04-10 21:48:06,254 [main] ERROR tc.apache/kafka:3.8.0 - Log output from the failed container:
sh: /tmp/testcontainers_start.sh: Text file busy

and it seem likely then on busy hosts the sleep 0.1 version can get the same error from time to time.

An easy fix is to run a separate 'notify command' step after the script is fully copied. With following changes:

private final class MyKafkaContainer extends KafkaContainer {
    public FixedKafkaContainer(String imageName) {
        super(imageName);
        withCommand("sh", "-c", "while [ ! -f /tmp/startup-done ]; do true; done; /tmp/testcontainers_start.sh");
    }

    @Override
    protected void containerIsStarting(InspectContainerResponse containerInfo) {
        super.containerIsStarting(containerInfo); // copy the script
        try {
            execInContainer("sh", "-c", "touch /tmp/startup-done"); // notify command *after* the copy has finished
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}

the container starts repeatedly without any errors.

If you think that this is a good solution I'd be more than happy to create a PR.

Relevant log output

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions