Use parallel curl downloads for website link checks#311
Conversation
|
👋 Hello @glenn-jocher, thank you for submitting a
For more guidance, please refer to our Contributing Guide. Don't hesitate to leave a comment if you have any questions. Thank you for contributing to Ultralytics! 🚀 |
UltralyticsAssistant
left a comment
There was a problem hiding this comment.
🔍 PR Review
Made with ❤️ by Ultralytics Actions
Good simplification overall, and the parallel curl approach should speed this workflow up. I found two regressions to address before merging: bare-root URLs can be written to the wrong local path, and the sitemap fetches lost retry protection.
💬 Posted 2 inline comments
| # Download initial sitemap and process | ||
| echo "Downloading sitemap..." | ||
| SITEMAP=$(wget --compression=auto -qO- "https://${{ matrix.website }}/sitemap.xml") || { echo "Failed to download sitemap"; exit 1; } | ||
| SITEMAP=$(curl --compressed -fsSL "https://${{ matrix.website }}/sitemap.xml") || { echo "Failed to download sitemap"; exit 1; } |
There was a problem hiding this comment.
💡 MEDIUM: Switching the sitemap fetches from wget to plain curl removes retry behavior here, so a transient network/TLS/5xx failure now aborts the whole job before the parallel download step even starts. Please add retries to this fetch, and mirror the same change for the sub-sitemap download below.
Suggested change:
| SITEMAP=$(curl --compressed -fsSL "https://${{ matrix.website }}/sitemap.xml") || { echo "Failed to download sitemap"; exit 1; } | |
| SITEMAP=$(curl --compressed --retry 3 --retry-all-errors -fsSL "https://${{ matrix.website }}/sitemap.xml") || { echo "Failed to download sitemap"; exit 1; } |
| if not url: | ||
| continue | ||
| parsed = urlsplit(url) | ||
| path = f"{parsed.netloc}{parsed.path}" |
There was a problem hiding this comment.
💡 MEDIUM: This path builder misclassifies bare-origin URLs like https://docs.ultralytics.com because path becomes just the hostname, and the '.' in Path(path).name check then treats the hostname dots as a file extension. That saves the root page as docs.ultralytics.com instead of docs.ultralytics.com/index.html, which changes the local base path and can throw off relative-link resolution for the homepage.
Suggested change:
| path = f"{parsed.netloc}{parsed.path}" | |
| path = f"{parsed.netloc}{parsed.path or '/'}" |
Summary
Validation
🛠️ PR Summary
Made with ❤️ by Ultralytics Actions
🌟 Summary
This PR speeds up website link checks by replacing sequential
wgetdownloads with parallelcurldownloads in the GitHub Actions link-check workflow. ⚡📊 Key Changes
wgettocurl --compressed -fsSLfor simpler, reliable compressed downloads.wgetwebsite download step and introduces a small Python helper to generate acurlconfig file fromurls.txt.index.htmland extensionless paths to.html.curl --parallel --parallel-max 16, while preserving retries, redirect handling, compression, and directory creation.🎯 Purpose & Impact
ultralytics/docs. ✅