Retry terraform init on transient 502/503 errors#1840
Retry terraform init on transient 502/503 errors#1840HuyPhanNguyen wants to merge 6 commits intomainfrom
Conversation
30f05d0 to
65a9b06
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces retry logic for terraform init to mitigate transient 502/503 provider-download failures, and also refactors ArgoCD manifest update/staging behavior (including path separator handling and purge/removal support).
Changes:
- Retry
terraform initup to 3 times on detected transient 502/503 errors. - Refactor ArgoCD manifest templating/update flow (new updater classes) and introduce explicit file removal staging.
- Normalize/report ArgoCD file paths using POSIX separators in service messages and tests.
Reviewed changes
Copilot reviewed 22 out of 23 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| source/Calamari.Terraform/TerraformCliExecutor.cs | Adds transient-error detection + retry/backoff around terraform init. |
| source/Calamari.Terraform.Tests/TerraformCliExecutorFixture.cs | Adds unit tests for transient error matching and retry behavior. |
| source/Calamari/ArgoCD/Git/RepositoryWrapper.cs | Replaces staging APIs and tweaks path normalization used for git index operations. |
| source/Calamari/ArgoCD/Git/RepositoryAdapter.cs | Uses new staging APIs and supports staging deletions; changes “has changes” detection. |
| source/Calamari/ArgoCD/Conventions/UpdateImageTag/FileUpdateResult.cs | Adds FilesRemoved and HasChanges() to represent deletions. |
| source/Calamari/ArgoCD/Conventions/UpdateImageTag/BaseUpdater.cs | Stops forcing POSIX paths during patch generation; now relies on later normalization. |
| source/Calamari/ArgoCD/Conventions/UpdateImageTag/AbstractHelmUpdater.cs | Same path handling adjustment + new FileUpdateResult shape. |
| source/Calamari/ArgoCD/Conventions/UpdateImageTag/DirectoryUpdater.cs | Updates FileUpdateResult construction for new signature. |
| source/Calamari/ArgoCD/Conventions/UpdateImageTag/HelmUpdater.cs | Updates FileUpdateResult construction for new signature. |
| source/Calamari/ArgoCD/Conventions/UpdateImageTag/KustomizeUpdater.cs | Updates FileUpdateResult construction for new signature. |
| source/Calamari/ArgoCD/Conventions/UpdateImageTag/BaseUpdater.cs | Removes earlier Windows path-to-POSIX conversion for patches. |
| source/Calamari/ArgoCD/Conventions/UpdateArgoCDApplicationManifestsInstallConvention.cs | Extracts application/source processing into new ManifestTemplating classes. |
| source/Calamari/ArgoCD/Conventions/ManifestTemplating/ManifestUpdateResult.cs | Adds standalone ManifestUpdateResult record. |
| source/Calamari/ArgoCD/Conventions/ManifestTemplating/CopyTemplatesSourceUpdater.cs | New source updater for copying templates + purge/delete tracking. |
| source/Calamari/ArgoCD/Conventions/ManifestTemplating/ApplicationUpdater.cs | New coordinator for application-level update flow. |
| source/Calamari/ArgoCD/Conventions/ManifestTemplating/ApplicationSourceUpdater.cs | New per-source update + push + output variable wiring. |
| source/Calamari/ArgoCD/ArgoCDFilesUpdatedReporter.cs | Converts reported file paths to POSIX before service message serialization. |
| source/Calamari.Tests/ArgoCD/Git/RepositoryWrapperTest.cs | Updates for new staging APIs and adds path separator coverage. |
| source/Calamari.Tests/ArgoCD/Git/RepositoryExtensionMethods.cs | Normalizes paths for git tree lookups/adds in tests. |
| source/Calamari.Tests/ArgoCD/Commands/Conventions/UpdateArgoCDApplicationManifestsInstallConventionTests.cs | Adjusts purge expectations and asserts directory removal via git semantics. |
| source/Calamari.Tests/ArgoCD/Commands/Conventions/UpdateArgoCDAppImagesInstallConventionTest.cs | Adjusts expected paths for OS-specific separators. |
| source/Calamari.Tests/ArgoCD/Commands/Conventions/UpdateArgoCDAppImagesInstallConventionHelmTests.cs | Adjusts expected paths for OS-specific separators. |
| source/Calamari.Tests/ArgoCD/ArgoCDFilesUpdatedReporterTests.cs | Adds coverage for converting OS-specific paths to POSIX in service messages. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| { | ||
| public bool HasChanges() | ||
| { | ||
| return PatchedFileContent.Count > 0 || FilesRemoved.Length > 0; |
There was a problem hiding this comment.
HasChanges() assumes FilesRemoved is non-null and will throw if a caller constructs FileUpdateResult with a null FilesRemoved array. Since RepositoryAdapter already treats FilesRemoved as nullable (result.FilesRemoved ?? []), consider making HasChanges null-safe as well (e.g., treat null as empty).
| return PatchedFileContent.Count > 0 || FilesRemoved.Length > 0; | |
| return PatchedFileContent.Count > 0 || (FilesRemoved?.Length ?? 0) > 0; |
| using System; | ||
| using System.Collections.Generic; |
There was a problem hiding this comment.
This file has an unused using System; directive, which will produce CS8019. Since this repository has projects with TreatWarningsAsErrors enabled, this can fail the build. Please remove the unused using (or use the type if it was intended).
| return new ManifestUpdateResult(true, sourceUpdateResult.PushResult.CommitSha, sourceUpdateResult.PatchedFiles); | ||
| } | ||
|
|
||
| log.Info("No changes were commited"); |
There was a problem hiding this comment.
Typo in log message: "commited" should be "committed".
| log.Info("No changes were commited"); | |
| log.Info("No changes were committed"); |
| var retry = new RetryTracker(maxRetries: 3, timeLimit: null, new LimitedExponentialRetryInterval(2000, 30000, 2)); | ||
| while (retry.Try()) | ||
| { | ||
| var commandResult = ExecuteCommandInternal(new[] { initCommand }, out _, true); | ||
| if (commandResult.ExitCode == 0) | ||
| return; | ||
|
|
||
| if (IsTransientInitError(commandResult.Errors) && retry.CanRetry()) | ||
| { | ||
| log.Warn($"Terraform init failed with a transient error (attempt {retry.CurrentTry} of 4). Retrying after backoff..."); | ||
| Thread.Sleep(retry.Sleep()); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
This retry loop uses Thread.Sleep(retry.Sleep()), which means callers block for real time on every transient failure. It also makes the new unit tests (e.g., retries exhausted) take tens of seconds (currently ~4s + 8s + 16s) and likely become slow/flaky in CI. Consider injecting a delay/sleeper (or a RetryInterval) so tests can use a no-op/zero delay while production uses real backoff.
| var retry = new RetryTracker(maxRetries: 3, timeLimit: null, new LimitedExponentialRetryInterval(2000, 30000, 2)); | ||
| while (retry.Try()) | ||
| { | ||
| var commandResult = ExecuteCommandInternal(new[] { initCommand }, out _, true); | ||
| if (commandResult.ExitCode == 0) | ||
| return; | ||
|
|
||
| if (IsTransientInitError(commandResult.Errors) && retry.CanRetry()) | ||
| { | ||
| log.Warn($"Terraform init failed with a transient error (attempt {retry.CurrentTry} of 4). Retrying after backoff..."); | ||
| Thread.Sleep(retry.Sleep()); | ||
| continue; |
There was a problem hiding this comment.
The backoff settings here don't match the PR description (5s, 10s, 20s). With LimitedExponentialRetryInterval(2000, 30000, 2) and RetryTracker.Sleep() using the current try number, the sleeps are 4s, 8s, 16s. Either update the interval parameters to achieve the intended delays, or adjust the PR description/log messaging accordingly.
| var excludedPaths = foldersExcludedFromPurge.Select(excludedFolder => Path.Combine(outputDirectory, excludedFolder)).ToList(); | ||
| var filesToRemove = fileSystem.EnumerateFilesRecursively(outputDirectory) | ||
| .Where(f => !excludedPaths.Any(f.StartsWith)) | ||
| .ToArray(); |
There was a problem hiding this comment.
The purge exclusion logic !excludedPaths.Any(f.StartsWith) will also exclude files like .gitignore or .gitmodules because their full path starts with .../.git. That means a purge of the repo root won't actually remove those files. Consider checking for an exact .git directory match (e.g., path == excluded || path.StartsWith(excluded + separator)) and use an ordinal (and on Windows, case-insensitive) comparison.
| var normalized = path.Replace(Path.DirectorySeparatorChar, Path.AltDirectorySeparatorChar); | ||
| return normalized.StartsWith($".{Path.AltDirectorySeparatorChar}") ? normalized.Substring(2) : normalized; |
There was a problem hiding this comment.
NormalizePath only replaces Path.DirectorySeparatorChar with Path.AltDirectorySeparatorChar. On non-Windows platforms both are '/', so any incoming paths containing backslashes (e.g., from external inputs) won't be normalized and staging/removal may fail. Consider normalizing both separators explicitly (e.g., replace '\' with '/') and then trimming any leading "./".
| var normalized = path.Replace(Path.DirectorySeparatorChar, Path.AltDirectorySeparatorChar); | |
| return normalized.StartsWith($".{Path.AltDirectorySeparatorChar}") ? normalized.Substring(2) : normalized; | |
| if (string.IsNullOrEmpty(path)) | |
| return path; | |
| // Always normalize to forward slashes so paths are consistent across platforms | |
| var normalized = path.Replace('\\', '/'); | |
| // Trim a leading "./" to match repository-relative paths expected by LibGit2Sharp | |
| return normalized.StartsWith("./", StringComparison.Ordinal) | |
| ? normalized.Substring(2) | |
| : normalized; |
Background
terraform initfrequently fails with502 Bad Gatewaywhen downloading providers from GitHub (e.g., dnsimple/dnsimple). This is a transient CDN/network error that resolves on retry, but currently Calamari runsterraform initexactly once and fails immediately on any non-zero exit code.Reported by Matt Richardson — the ReleaseBot project on deploy.octopus.app hits this "fairly frequently."
Results
terraform initnow retries up to 3 times with exponential backoff (5s, 10s, 20s) when a transient HTTP error (502/503) is detected in stderr. Non-transient errors still fail immediately.fixes HPY-1283
Before
terraform initfails on first 502 Bad Gateway. Deployment fails with a stack trace. User must manually re-deploy.After
terraform initdetects the transient 502/503 in stderr, logs a warning, backs off, and retries. If the error clears (which it usually does), init succeeds and the deployment continues. If all retries are exhausted, the original error is thrown as before.