Skip to content

Add buildroot.py to mirror sources.buildroot.net via .mk parsing#204

Open
yaoge123 wants to merge 3 commits into
tuna:masterfrom
yaoge123:add-buildroot-py
Open

Add buildroot.py to mirror sources.buildroot.net via .mk parsing#204
yaoge123 wants to merge 3 commits into
tuna:masterfrom
yaoge123:add-buildroot-py

Conversation

@yaoge123
Copy link
Copy Markdown
Contributor

@yaoge123 yaoge123 commented May 24, 2026

Summary

Add buildroot.py to mirror sources.buildroot.net by extracting download URLs straight out of the buildroot source tree.

Why

sources.buildroot.net, the official Buildroot package source backup, has permanently disabled directory listing (Cloudflare 403 on every path that isn't a known file). Conventional approaches all fail:

Tool Result
rsync Upstream offers no rsync module
tsumugu / wget --mirror Cannot enumerate, every directory request 403
recursive HTML scraping No HTML index exists

The Buildroot maintainers have stated on the mailing list that they will not enable rsync. So a mirror has to derive the file list from somewhere else.

How it works

  1. Shallow-clone the buildroot master branch
  2. Parse boot/*.mk, linux/*.mk, and package/*/*.mk for VERSION, SITE, SOURCE
  3. Expand version variables and the upstream macros ($(call github,...), $(call gitlab,...), $(call sourceforge,...))
  4. For each derived URL, compare against the local mirror by Content-Length
  5. Download new/changed files atomically (.tmp + rename)
  6. Optionally clean up stale files up to TUNASYNC_BUILDROOT_MAXDELETE

Notes

  • Existing local data is preserved by default
  • GitHub API rate-limit is avoided by using archive tarball URLs, not the API
  • Version-variable substitution is heuristic; a small VERSION_OVERRIDES dict in the script handles known awkward cases

Copilot AI review requested due to automatic review settings May 24, 2026 09:21
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a self-contained Python sync script to mirror Buildroot package sources without relying on upstream directory listing, by extracting download URLs from the Buildroot git tree and incrementally downloading missing artifacts.

Changes:

  • Introduces buildroot.py to clone/update Buildroot git, parse .mk metadata, and download package sources incrementally.
  • Adds basic logging + state tracking to skip runs when git HEAD hasn’t changed.
  • Implements atomic downloads via .tmp → rename with fallback to sources.buildroot.net.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread buildroot.py Outdated
Comment on lines +150 to +154
return f"https://github.com/{repo_org}/{repo_name}/archive/{version}/{repo_name}-{version}.tar.gz"


def emit_gitlab(repo_org: str, repo_name: str, version: str) -> str:
return f"https://gitlab.com/{repo_org}/{repo_name}/-/archive/{version}/{repo_name}-{version}.tar.gz"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. emit_github / emit_gitlab now return a URL whose path already ends with the tarball filename, and sync_package no longer re-appends the local filename on top of those URLs (only when the configured SITE is a directory).

Comment thread buildroot.py Outdated
Comment on lines +324 to +325
full = url.rstrip("/") + "/" + local_file if url else None
if full:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the line-151 reply: emit_github / emit_gitlab URLs are now treated as full file URLs in sync_package via url_has_filename().

Comment thread buildroot.py Outdated
Comment on lines +316 to +320
# Skip if local file exists and is non-empty
if dest.exists() and dest.stat().st_size > 0:
stats["skipped"] += 1
return True

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. bump('total') is now invoked at the very top of sync_package, before any skip return, so the final summary covers every package candidate.

Comment thread buildroot.py Outdated
Comment on lines +410 to +411
if stats["total"] % 500 == 0:
log(f" Progress: {stats['total']}/{len(packages)} — "
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the line-317 reply: total is now bumped up front.

Comment thread buildroot.py
Comment on lines +91 to +95
def wget(url: str, dest: Path) -> bool:
"""Download a file via wget, follow redirects."""
dest.parent.mkdir(parents=True, exist_ok=True)
tmp = dest.with_suffix(dest.suffix + ".tmp")
tmp.unlink(missing_ok=True)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wget() now removes the .tmp file in a finally block, so it is cleaned up on every code path (success, non-zero curl, exception, or empty body).

Comment thread buildroot.py
Comment on lines +56 to +58
MAXDELETE = int(os.environ.get("TUNASYNC_BUILDROOT_MAXDELETE", "10000"))
JOBS = int(os.environ.get("TUNASYNC_BUILDROOT_JOBS", "4"))
DRYRUN = os.environ.get("TUNASYNC_BUILDROOT_DRYRUN", "") in ("1", "true", "yes")
Copy link
Copy Markdown
Contributor Author

@yaoge123 yaoge123 May 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated: after reflection I scaled this back. JOBS now defaults to 1 (serial) to match what most repo scripts assume — the wget/git invocations underneath aren't friendly to concurrent invocations on the same upstream and the mirror was never meant to be aggressive. Operators who do want parallelism can still set TUNASYNC_BUILDROOT_JOBS=N. The ThreadPoolExecutor + stats_lock plumbing stays so the option is available. Adjusted in fb0bc70.

Comment thread buildroot.py Outdated
Comment on lines +348 to +356
def clean_stale_files() -> None:
"""Remove local files that are no longer in the current buildroot tree."""
# Build set of known directories from packages
known_names = set()
# We don't actually know what's "stale" without a full scan —
# so we skip this phase unless the user explicitly wants it.
# The buildroot mirror is append-only by design.
log("Cleanup: skipped (buildroot mirror is append-only by default)")

Copy link
Copy Markdown
Contributor Author

@yaoge123 yaoge123 May 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated: after reflection I scaled this back. JOBS now defaults to 1 (serial) to match what most repo scripts assume — the wget/git invocations underneath aren't friendly to concurrent invocations on the same upstream and the mirror was never meant to be aggressive. Operators who do want parallelism can still set TUNASYNC_BUILDROOT_JOBS=N. The ThreadPoolExecutor + stats_lock plumbing stays so the option is available. Adjusted in fb0bc70.

Comment thread buildroot.py Outdated
Comment on lines +407 to +409
log("Step 3: Downloading new/changed files...")
for pkg in packages:
sync_package(pkg)
Copy link
Copy Markdown
Contributor Author

@yaoge123 yaoge123 May 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated: after reflection I scaled this back. JOBS now defaults to 1 (serial) to match what most repo scripts assume — the wget/git invocations underneath aren't friendly to concurrent invocations on the same upstream and the mirror was never meant to be aggressive. Operators who do want parallelism can still set TUNASYNC_BUILDROOT_JOBS=N. The ThreadPoolExecutor + stats_lock plumbing stays so the option is available. Adjusted in fb0bc70.

Comment thread buildroot.py
Comment on lines +87 to +88
def run(cmd: list, **kw) -> subprocess.CompletedProcess:
return subprocess.run(cmd, capture_output=True, text=True, timeout=180, **kw)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. git fetch, git reset --hard, and git clone all check returncode now and log stderr; if the tree cannot be fetched the script aborts with rc=1 instead of running on an empty package list.

Comment thread buildroot.py Outdated
Comment on lines +373 to +378
run(["git", "-C", str(BR_GIT_DIR), "fetch", "--depth=1", "origin", BR_BRANCH])
run(["git", "-C", str(BR_GIT_DIR), "reset", "--hard", f"origin/{BR_BRANCH}"])
else:
r = run(["git", "clone", "--depth=1", "--branch", BR_BRANCH, BR_GIT_URL, str(BR_GIT_DIR)])
if r.returncode != 0:
raise Exception("clone failed")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the line-85 reply.

sources.buildroot.net (the official Buildroot backup site) has
permanently disabled directory listing (Cloudflare 403). Neither rsync,
tsumugu, nor wget -m can discover what files exist on the server.

This script clones the buildroot git tree, extracts the download URL
for every package directly from the .mk files, and downloads only new
or changed files. Existing local data is preserved (no deletion unless
explicitly enabled).

Steps:
  1. Shallow-clone buildroot master
  2. Parse boot/*.mk, linux/*.mk, package/*/*.mk for VERSION, SITE, SOURCE
  3. Expand version variables and macro calls (github, gitlab, sourceforge)
  4. Compare each candidate URL against the local mirror
  5. Download new/changed files atomically (.tmp -> rename)
  6. Clean up stale files up to TUNASYNC_BUILDROOT_MAXDELETE

GitHub API rate-limit is avoided by using archive tarball URLs.
SourceForge redirects are followed by wget, not by the script.
@yaoge123 yaoge123 force-pushed the add-buildroot-py branch from 8f2e0cb to ff1cbf5 Compare May 24, 2026 09:25
yaoge123 added 2 commits May 24, 2026 17:47
Address review feedback:
- emit_github / emit_gitlab now return URLs that already include the
  filename, and sync_package no longer re-appends the local filename
  on top of them, fixing broken URLs like '.../tar.gz/repo-1.2.3.tar.gz'.
- bump('total') is called before any skip path, so progress counters
  and the final summary include skipped packages.
- wget() now removes the .tmp file in a finally block on every code
  path, so failed downloads no longer leave .tmp files behind.
- Variable assignment regex now matches Makefile ':=' immediate
  assignment, not just '=', '?=', '+='.
- JOBS is now real concurrency via concurrent.futures.ThreadPoolExecutor,
  and stats updates go through a lock.
- TUNASYNC_BUILDROOT_CLEANUP enables stale-file cleanup; deletions are
  refused when they would exceed TUNASYNC_BUILDROOT_MAXDELETE.
- git fetch / reset / clone return codes are checked; if the git tree
  cannot be fetched the script aborts (return code 1) instead of
  silently running on an empty package list.
Per Copilot review feedback: parallel sync of buildroot can amplify
upstream load when many tunasync mirrors deploy this script. Default
to serial (JOBS=1) and let operators opt into concurrency via
TUNASYNC_BUILDROOT_JOBS.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants