Add buildroot.py to mirror sources.buildroot.net via .mk parsing#204
Add buildroot.py to mirror sources.buildroot.net via .mk parsing#204yaoge123 wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a self-contained Python sync script to mirror Buildroot package sources without relying on upstream directory listing, by extracting download URLs from the Buildroot git tree and incrementally downloading missing artifacts.
Changes:
- Introduces
buildroot.pyto clone/update Buildroot git, parse.mkmetadata, and download package sources incrementally. - Adds basic logging + state tracking to skip runs when git HEAD hasn’t changed.
- Implements atomic downloads via
.tmp → renamewith fallback tosources.buildroot.net.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return f"https://github.com/{repo_org}/{repo_name}/archive/{version}/{repo_name}-{version}.tar.gz" | ||
|
|
||
|
|
||
| def emit_gitlab(repo_org: str, repo_name: str, version: str) -> str: | ||
| return f"https://gitlab.com/{repo_org}/{repo_name}/-/archive/{version}/{repo_name}-{version}.tar.gz" |
There was a problem hiding this comment.
Fixed. emit_github / emit_gitlab now return a URL whose path already ends with the tarball filename, and sync_package no longer re-appends the local filename on top of those URLs (only when the configured SITE is a directory).
| full = url.rstrip("/") + "/" + local_file if url else None | ||
| if full: |
There was a problem hiding this comment.
Same as the line-151 reply: emit_github / emit_gitlab URLs are now treated as full file URLs in sync_package via url_has_filename().
| # Skip if local file exists and is non-empty | ||
| if dest.exists() and dest.stat().st_size > 0: | ||
| stats["skipped"] += 1 | ||
| return True | ||
|
|
There was a problem hiding this comment.
Done. bump('total') is now invoked at the very top of sync_package, before any skip return, so the final summary covers every package candidate.
| if stats["total"] % 500 == 0: | ||
| log(f" Progress: {stats['total']}/{len(packages)} — " |
There was a problem hiding this comment.
Same as the line-317 reply: total is now bumped up front.
| def wget(url: str, dest: Path) -> bool: | ||
| """Download a file via wget, follow redirects.""" | ||
| dest.parent.mkdir(parents=True, exist_ok=True) | ||
| tmp = dest.with_suffix(dest.suffix + ".tmp") | ||
| tmp.unlink(missing_ok=True) |
There was a problem hiding this comment.
wget() now removes the .tmp file in a finally block, so it is cleaned up on every code path (success, non-zero curl, exception, or empty body).
| MAXDELETE = int(os.environ.get("TUNASYNC_BUILDROOT_MAXDELETE", "10000")) | ||
| JOBS = int(os.environ.get("TUNASYNC_BUILDROOT_JOBS", "4")) | ||
| DRYRUN = os.environ.get("TUNASYNC_BUILDROOT_DRYRUN", "") in ("1", "true", "yes") |
There was a problem hiding this comment.
Updated: after reflection I scaled this back. JOBS now defaults to 1 (serial) to match what most repo scripts assume — the wget/git invocations underneath aren't friendly to concurrent invocations on the same upstream and the mirror was never meant to be aggressive. Operators who do want parallelism can still set TUNASYNC_BUILDROOT_JOBS=N. The ThreadPoolExecutor + stats_lock plumbing stays so the option is available. Adjusted in fb0bc70.
| def clean_stale_files() -> None: | ||
| """Remove local files that are no longer in the current buildroot tree.""" | ||
| # Build set of known directories from packages | ||
| known_names = set() | ||
| # We don't actually know what's "stale" without a full scan — | ||
| # so we skip this phase unless the user explicitly wants it. | ||
| # The buildroot mirror is append-only by design. | ||
| log("Cleanup: skipped (buildroot mirror is append-only by default)") | ||
|
|
There was a problem hiding this comment.
Updated: after reflection I scaled this back. JOBS now defaults to 1 (serial) to match what most repo scripts assume — the wget/git invocations underneath aren't friendly to concurrent invocations on the same upstream and the mirror was never meant to be aggressive. Operators who do want parallelism can still set TUNASYNC_BUILDROOT_JOBS=N. The ThreadPoolExecutor + stats_lock plumbing stays so the option is available. Adjusted in fb0bc70.
| log("Step 3: Downloading new/changed files...") | ||
| for pkg in packages: | ||
| sync_package(pkg) |
There was a problem hiding this comment.
Updated: after reflection I scaled this back. JOBS now defaults to 1 (serial) to match what most repo scripts assume — the wget/git invocations underneath aren't friendly to concurrent invocations on the same upstream and the mirror was never meant to be aggressive. Operators who do want parallelism can still set TUNASYNC_BUILDROOT_JOBS=N. The ThreadPoolExecutor + stats_lock plumbing stays so the option is available. Adjusted in fb0bc70.
| def run(cmd: list, **kw) -> subprocess.CompletedProcess: | ||
| return subprocess.run(cmd, capture_output=True, text=True, timeout=180, **kw) |
There was a problem hiding this comment.
Done. git fetch, git reset --hard, and git clone all check returncode now and log stderr; if the tree cannot be fetched the script aborts with rc=1 instead of running on an empty package list.
| run(["git", "-C", str(BR_GIT_DIR), "fetch", "--depth=1", "origin", BR_BRANCH]) | ||
| run(["git", "-C", str(BR_GIT_DIR), "reset", "--hard", f"origin/{BR_BRANCH}"]) | ||
| else: | ||
| r = run(["git", "clone", "--depth=1", "--branch", BR_BRANCH, BR_GIT_URL, str(BR_GIT_DIR)]) | ||
| if r.returncode != 0: | ||
| raise Exception("clone failed") |
There was a problem hiding this comment.
Same as the line-85 reply.
sources.buildroot.net (the official Buildroot backup site) has permanently disabled directory listing (Cloudflare 403). Neither rsync, tsumugu, nor wget -m can discover what files exist on the server. This script clones the buildroot git tree, extracts the download URL for every package directly from the .mk files, and downloads only new or changed files. Existing local data is preserved (no deletion unless explicitly enabled). Steps: 1. Shallow-clone buildroot master 2. Parse boot/*.mk, linux/*.mk, package/*/*.mk for VERSION, SITE, SOURCE 3. Expand version variables and macro calls (github, gitlab, sourceforge) 4. Compare each candidate URL against the local mirror 5. Download new/changed files atomically (.tmp -> rename) 6. Clean up stale files up to TUNASYNC_BUILDROOT_MAXDELETE GitHub API rate-limit is avoided by using archive tarball URLs. SourceForge redirects are followed by wget, not by the script.
8f2e0cb to
ff1cbf5
Compare
Address review feedback:
- emit_github / emit_gitlab now return URLs that already include the
filename, and sync_package no longer re-appends the local filename
on top of them, fixing broken URLs like '.../tar.gz/repo-1.2.3.tar.gz'.
- bump('total') is called before any skip path, so progress counters
and the final summary include skipped packages.
- wget() now removes the .tmp file in a finally block on every code
path, so failed downloads no longer leave .tmp files behind.
- Variable assignment regex now matches Makefile ':=' immediate
assignment, not just '=', '?=', '+='.
- JOBS is now real concurrency via concurrent.futures.ThreadPoolExecutor,
and stats updates go through a lock.
- TUNASYNC_BUILDROOT_CLEANUP enables stale-file cleanup; deletions are
refused when they would exceed TUNASYNC_BUILDROOT_MAXDELETE.
- git fetch / reset / clone return codes are checked; if the git tree
cannot be fetched the script aborts (return code 1) instead of
silently running on an empty package list.
Per Copilot review feedback: parallel sync of buildroot can amplify upstream load when many tunasync mirrors deploy this script. Default to serial (JOBS=1) and let operators opt into concurrency via TUNASYNC_BUILDROOT_JOBS.
Summary
Add
buildroot.pyto mirrorsources.buildroot.netby extracting download URLs straight out of the buildroot source tree.Why
sources.buildroot.net, the official Buildroot package source backup, has permanently disabled directory listing (Cloudflare 403 on every path that isn't a known file). Conventional approaches all fail:The Buildroot maintainers have stated on the mailing list that they will not enable rsync. So a mirror has to derive the file list from somewhere else.
How it works
boot/*.mk,linux/*.mk, andpackage/*/*.mkforVERSION,SITE,SOURCE$(call github,...),$(call gitlab,...),$(call sourceforge,...))Content-Length.tmp+ rename)TUNASYNC_BUILDROOT_MAXDELETENotes
VERSION_OVERRIDESdict in the script handles known awkward cases