Skip to content

Retry terraform init on transient 502/503 errors#1840

Open
HuyPhanNguyen wants to merge 6 commits intomainfrom
huy/hpy-1283-terraform-init-retry-502
Open

Retry terraform init on transient 502/503 errors#1840
HuyPhanNguyen wants to merge 6 commits intomainfrom
huy/hpy-1283-terraform-init-retry-502

Conversation

@HuyPhanNguyen
Copy link
Copy Markdown
Contributor

@HuyPhanNguyen HuyPhanNguyen commented Mar 23, 2026

Background

terraform init frequently fails with 502 Bad Gateway when downloading providers from GitHub (e.g., dnsimple/dnsimple). This is a transient CDN/network error that resolves on retry, but currently Calamari runs terraform init exactly once and fails immediately on any non-zero exit code.

Reported by Matt Richardson — the ReleaseBot project on deploy.octopus.app hits this "fairly frequently."

Results

terraform init now retries up to 3 times with exponential backoff (5s, 10s, 20s) when a transient HTTP error (502/503) is detected in stderr. Non-transient errors still fail immediately.

fixes HPY-1283

Before

terraform init fails on first 502 Bad Gateway. Deployment fails with a stack trace. User must manually re-deploy.

After

terraform init detects the transient 502/503 in stderr, logs a warning, backs off, and retries. If the error clears (which it usually does), init succeeds and the deployment continues. If all retries are exhausted, the original error is thrown as before.

@HuyPhanNguyen HuyPhanNguyen force-pushed the huy/hpy-1283-terraform-init-retry-502 branch from 30f05d0 to 65a9b06 Compare March 23, 2026 06:57
@HuyPhanNguyen HuyPhanNguyen marked this pull request as ready for review March 23, 2026 07:04
@wlthomson wlthomson requested a review from a team March 24, 2026 00:05
@hnrkndrssn hnrkndrssn requested a review from Copilot March 25, 2026 21:09
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces retry logic for terraform init to mitigate transient 502/503 provider-download failures, and also refactors ArgoCD manifest update/staging behavior (including path separator handling and purge/removal support).

Changes:

  • Retry terraform init up to 3 times on detected transient 502/503 errors.
  • Refactor ArgoCD manifest templating/update flow (new updater classes) and introduce explicit file removal staging.
  • Normalize/report ArgoCD file paths using POSIX separators in service messages and tests.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
source/Calamari.Terraform/TerraformCliExecutor.cs Adds transient-error detection + retry/backoff around terraform init.
source/Calamari.Terraform.Tests/TerraformCliExecutorFixture.cs Adds unit tests for transient error matching and retry behavior.
source/Calamari/ArgoCD/Git/RepositoryWrapper.cs Replaces staging APIs and tweaks path normalization used for git index operations.
source/Calamari/ArgoCD/Git/RepositoryAdapter.cs Uses new staging APIs and supports staging deletions; changes “has changes” detection.
source/Calamari/ArgoCD/Conventions/UpdateImageTag/FileUpdateResult.cs Adds FilesRemoved and HasChanges() to represent deletions.
source/Calamari/ArgoCD/Conventions/UpdateImageTag/BaseUpdater.cs Stops forcing POSIX paths during patch generation; now relies on later normalization.
source/Calamari/ArgoCD/Conventions/UpdateImageTag/AbstractHelmUpdater.cs Same path handling adjustment + new FileUpdateResult shape.
source/Calamari/ArgoCD/Conventions/UpdateImageTag/DirectoryUpdater.cs Updates FileUpdateResult construction for new signature.
source/Calamari/ArgoCD/Conventions/UpdateImageTag/HelmUpdater.cs Updates FileUpdateResult construction for new signature.
source/Calamari/ArgoCD/Conventions/UpdateImageTag/KustomizeUpdater.cs Updates FileUpdateResult construction for new signature.
source/Calamari/ArgoCD/Conventions/UpdateImageTag/BaseUpdater.cs Removes earlier Windows path-to-POSIX conversion for patches.
source/Calamari/ArgoCD/Conventions/UpdateArgoCDApplicationManifestsInstallConvention.cs Extracts application/source processing into new ManifestTemplating classes.
source/Calamari/ArgoCD/Conventions/ManifestTemplating/ManifestUpdateResult.cs Adds standalone ManifestUpdateResult record.
source/Calamari/ArgoCD/Conventions/ManifestTemplating/CopyTemplatesSourceUpdater.cs New source updater for copying templates + purge/delete tracking.
source/Calamari/ArgoCD/Conventions/ManifestTemplating/ApplicationUpdater.cs New coordinator for application-level update flow.
source/Calamari/ArgoCD/Conventions/ManifestTemplating/ApplicationSourceUpdater.cs New per-source update + push + output variable wiring.
source/Calamari/ArgoCD/ArgoCDFilesUpdatedReporter.cs Converts reported file paths to POSIX before service message serialization.
source/Calamari.Tests/ArgoCD/Git/RepositoryWrapperTest.cs Updates for new staging APIs and adds path separator coverage.
source/Calamari.Tests/ArgoCD/Git/RepositoryExtensionMethods.cs Normalizes paths for git tree lookups/adds in tests.
source/Calamari.Tests/ArgoCD/Commands/Conventions/UpdateArgoCDApplicationManifestsInstallConventionTests.cs Adjusts purge expectations and asserts directory removal via git semantics.
source/Calamari.Tests/ArgoCD/Commands/Conventions/UpdateArgoCDAppImagesInstallConventionTest.cs Adjusts expected paths for OS-specific separators.
source/Calamari.Tests/ArgoCD/Commands/Conventions/UpdateArgoCDAppImagesInstallConventionHelmTests.cs Adjusts expected paths for OS-specific separators.
source/Calamari.Tests/ArgoCD/ArgoCDFilesUpdatedReporterTests.cs Adds coverage for converting OS-specific paths to POSIX in service messages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

{
public bool HasChanges()
{
return PatchedFileContent.Count > 0 || FilesRemoved.Length > 0;
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HasChanges() assumes FilesRemoved is non-null and will throw if a caller constructs FileUpdateResult with a null FilesRemoved array. Since RepositoryAdapter already treats FilesRemoved as nullable (result.FilesRemoved ?? []), consider making HasChanges null-safe as well (e.g., treat null as empty).

Suggested change
return PatchedFileContent.Count > 0 || FilesRemoved.Length > 0;
return PatchedFileContent.Count > 0 || (FilesRemoved?.Length ?? 0) > 0;

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +2
using System;
using System.Collections.Generic;
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has an unused using System; directive, which will produce CS8019. Since this repository has projects with TreatWarningsAsErrors enabled, this can fail the build. Please remove the unused using (or use the type if it was intended).

Copilot uses AI. Check for mistakes.
return new ManifestUpdateResult(true, sourceUpdateResult.PushResult.CommitSha, sourceUpdateResult.PatchedFiles);
}

log.Info("No changes were commited");
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in log message: "commited" should be "committed".

Suggested change
log.Info("No changes were commited");
log.Info("No changes were committed");

Copilot uses AI. Check for mistakes.
Comment on lines +224 to +236
var retry = new RetryTracker(maxRetries: 3, timeLimit: null, new LimitedExponentialRetryInterval(2000, 30000, 2));
while (retry.Try())
{
var commandResult = ExecuteCommandInternal(new[] { initCommand }, out _, true);
if (commandResult.ExitCode == 0)
return;

if (IsTransientInitError(commandResult.Errors) && retry.CanRetry())
{
log.Warn($"Terraform init failed with a transient error (attempt {retry.CurrentTry} of 4). Retrying after backoff...");
Thread.Sleep(retry.Sleep());
continue;
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This retry loop uses Thread.Sleep(retry.Sleep()), which means callers block for real time on every transient failure. It also makes the new unit tests (e.g., retries exhausted) take tens of seconds (currently ~4s + 8s + 16s) and likely become slow/flaky in CI. Consider injecting a delay/sleeper (or a RetryInterval) so tests can use a no-op/zero delay while production uses real backoff.

Copilot uses AI. Check for mistakes.
Comment on lines +224 to +235
var retry = new RetryTracker(maxRetries: 3, timeLimit: null, new LimitedExponentialRetryInterval(2000, 30000, 2));
while (retry.Try())
{
var commandResult = ExecuteCommandInternal(new[] { initCommand }, out _, true);
if (commandResult.ExitCode == 0)
return;

if (IsTransientInitError(commandResult.Errors) && retry.CanRetry())
{
log.Warn($"Terraform init failed with a transient error (attempt {retry.CurrentTry} of 4). Retrying after backoff...");
Thread.Sleep(retry.Sleep());
continue;
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backoff settings here don't match the PR description (5s, 10s, 20s). With LimitedExponentialRetryInterval(2000, 30000, 2) and RetryTracker.Sleep() using the current try number, the sleeps are 4s, 8s, 16s. Either update the interval parameters to achieve the intended delays, or adjust the PR description/log messaging accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +82
var excludedPaths = foldersExcludedFromPurge.Select(excludedFolder => Path.Combine(outputDirectory, excludedFolder)).ToList();
var filesToRemove = fileSystem.EnumerateFilesRecursively(outputDirectory)
.Where(f => !excludedPaths.Any(f.StartsWith))
.ToArray();
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purge exclusion logic !excludedPaths.Any(f.StartsWith) will also exclude files like .gitignore or .gitmodules because their full path starts with .../.git. That means a purge of the repo root won't actually remove those files. Consider checking for an exact .git directory match (e.g., path == excluded || path.StartsWith(excluded + separator)) and use an ordinal (and on Windows, case-insensitive) comparison.

Copilot uses AI. Check for mistakes.
Comment on lines +87 to +88
var normalized = path.Replace(Path.DirectorySeparatorChar, Path.AltDirectorySeparatorChar);
return normalized.StartsWith($".{Path.AltDirectorySeparatorChar}") ? normalized.Substring(2) : normalized;
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NormalizePath only replaces Path.DirectorySeparatorChar with Path.AltDirectorySeparatorChar. On non-Windows platforms both are '/', so any incoming paths containing backslashes (e.g., from external inputs) won't be normalized and staging/removal may fail. Consider normalizing both separators explicitly (e.g., replace '\' with '/') and then trimming any leading "./".

Suggested change
var normalized = path.Replace(Path.DirectorySeparatorChar, Path.AltDirectorySeparatorChar);
return normalized.StartsWith($".{Path.AltDirectorySeparatorChar}") ? normalized.Substring(2) : normalized;
if (string.IsNullOrEmpty(path))
return path;
// Always normalize to forward slashes so paths are consistent across platforms
var normalized = path.Replace('\\', '/');
// Trim a leading "./" to match repository-relative paths expected by LibGit2Sharp
return normalized.StartsWith("./", StringComparison.Ordinal)
? normalized.Substring(2)
: normalized;

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@hnrkndrssn hnrkndrssn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants