Rewrite proxmox.sh: replace apt-sync+lftp with tsumugu + /iso parser#206
Rewrite proxmox.sh: replace apt-sync+lftp with tsumugu + /iso parser#206yaoge123 wants to merge 2 commits into
Conversation
The current proxmox.sh runs apt-sync.py for the deb repos and lftp for
the ISO directory. This rewrite replaces both with a single tsumugu run
plus a small Python parser for /iso, which is necessary because:
1. download.proxmox.com is no longer a single Apache directory tree;
/iso/ now serves a custom HTML page (not autoindex).
2. tsumugu correctly handles both nginx-style autoindex and the custom
HTML, given a parser hint, so we get a single coherent walk over
/debian, /pmg, /pve, /devel and /iso instead of two tools.
3. Splitting deb sync (apt-sync) from ISO sync (lftp) means partial
failures on one side leave the mirror in inconsistent state.
The Python /iso parser:
- extracts <a href> entries, ignoring anchors and navigation
- HEADs every candidate file, downloads only when size differs
- writes via .tmp + os.replace (atomic)
- bounds deletes by TUNASYNC_PROXMOX_ISO_MAXDELETE (default 100)
Behavior preserved:
- same workdir layout (/debian, /pmg, /pve, /devel, /iso siblings)
- same logging style
- tunasync size reporting via REPO_SIZE_FILE / size-sum.sh
This is a full replacement, not an add-on; the old script does not run
correctly against today's /iso layout.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates the Proxmox mirror sync wrapper to safely synchronize /debian and /images via tsumugu, while handling the non-indexed /iso page with a custom HTML parser/downloader to avoid incorrect parsing and unsafe deletions.
Changes:
- Replace prior apt-sync/lftp approach with
tsumugu syncscoped per subtree for safe cleanup behavior. - Add an embedded Python
/isosync stage that discovers artifacts from the custom HTML page, performs HEAD metadata checks, downloads atomically, and deletes stale files with a bounded limit. - Add extensive in-script rationale documenting why
/isomust be handled separately.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # requires: wget, timeout | ||
| set -e | ||
| set -o pipefail | ||
| set -eu |
There was a problem hiding this comment.
Done. Switched 'set -eu' to 'set -euo pipefail' so a failure inside the USERAGENT pipeline (or any future pipeline) is no longer hidden by a successful tail.
| WORKDIR=${TUNASYNC_WORKING_DIR:?TUNASYNC_WORKING_DIR is required} | ||
| THREADS=${TUNASYNC_TSUMUGU_THREADS:-1} | ||
| MAXDELETE=${TUNASYNC_TSUMUGU_MAXDELETE:-10000} | ||
| USERAGENT=${TUNASYNC_TSUMUGU_USERAGENT:-"tsumugu/$(tsumugu --version | tail -n1 | cut -d' ' -f2)"} |
There was a problem hiding this comment.
Same as the line-2 reply: pipefail is now on.
| WORKDIR=${TUNASYNC_WORKING_DIR:?TUNASYNC_WORKING_DIR is required} | ||
| THREADS=${TUNASYNC_TSUMUGU_THREADS:-1} | ||
| MAXDELETE=${TUNASYNC_TSUMUGU_MAXDELETE:-10000} | ||
| USERAGENT=${TUNASYNC_TSUMUGU_USERAGENT:-"tsumugu/$(tsumugu --version | tail -n1 | cut -d' ' -f2)"} |
There was a problem hiding this comment.
Fixed. After computing USERAGENT we 'export TUNASYNC_TSUMUGU_USERAGENT="$USERAGENT"' so the Python /iso stage inherits the same value. The tsumugu invocation continues to pass --user-agent "$USERAGENT". Both stages now agree on UA.
| base = os.environ.get('TUNASYNC_UPSTREAM_URL', 'http://download.proxmox.com/').rstrip('/') + '/iso/' | ||
| work = Path(os.environ['TUNASYNC_WORKING_DIR']) / 'iso' | ||
| work.mkdir(parents=True, exist_ok=True) | ||
| user_agent = os.environ.get('TUNASYNC_TSUMUGU_USERAGENT', 'tsumugu') |
There was a problem hiding this comment.
Same as the line-35 reply: USERAGENT is exported into the Python stage's environment.
| # separately reviewed after implementation and all official *.sha256 files | ||
| # matched their local *.iso payloads. Future reviews can repeat that check. | ||
| socket.setdefaulttimeout(60) | ||
| base = os.environ.get('TUNASYNC_UPSTREAM_URL', 'http://download.proxmox.com/').rstrip('/') + '/iso/' |
There was a problem hiding this comment.
Fixed. The Python stage now derives 'allowed_netloc = urllib.parse.urlparse(base).netloc' from TUNASYNC_UPSTREAM_URL and validates each candidate URL against that. The hardcoded 'download.proxmox.com' string is gone.
| url = urllib.parse.urljoin(base, href) | ||
| parsed = urllib.parse.urlparse(url) | ||
| if parsed.scheme not in ('http', 'https') or parsed.netloc != 'download.proxmox.com': | ||
| continue |
There was a problem hiding this comment.
Same as the line-107 reply.
| head_req = urllib.request.Request(url, method='HEAD', headers={'User-Agent': user_agent}) | ||
| with urllib.request.urlopen(head_req, timeout=30) as resp: | ||
| size = int(resp.headers.get('Content-Length', '-1')) | ||
| lm = resp.headers.get('Last-Modified') |
There was a problem hiding this comment.
Added a Range: bytes=0-0 GET fallback. When HEAD raises an exception we issue 'Range: bytes=0-0', extract Content-Range/Last-Modified, and proceed. Existing files are kept on transient failure; missing files still produce a non-zero exit at the end of the run.
| except Exception as e: | ||
| # A transient HEAD failure for a file we already have should not delete | ||
| # or re-download the local file. A missing local file plus HEAD failure | ||
| # is recorded as an error so the job exits non-zero after processing. | ||
| if target.exists(): | ||
| print(f'proxmox iso: HEAD failed for existing {name}: {e}; keeping local file', flush=True) | ||
| continue | ||
| print(f'proxmox iso: HEAD failed for missing {name}: {e}', file=sys.stderr, flush=True) | ||
| errors.append(f'{name}: HEAD {e}') | ||
| continue |
There was a problem hiding this comment.
Same as the line-170 reply: GET-with-Range fallback now covers HEAD-rejecting CDNs.
Address review feedback: - Switch from 'set -eu' to 'set -euo pipefail' so failures inside the USERAGENT pipeline (and any future pipelines) are caught. - Export TUNASYNC_TSUMUGU_USERAGENT after computing it so the value passed to tsumugu sync also reaches the Python /iso stage; the two stages can no longer disagree on User-Agent. - Derive the allowed origin for /iso file URLs from TUNASYNC_UPSTREAM_URL via urlparse(...).netloc instead of hardcoding 'download.proxmox.com', so a different upstream tunasync deploys through this script does not get rejected. - Add a Range: bytes=0-0 GET fallback when HEAD fails (some CDNs reject HEAD with 405 or 403). We extract size from Content-Range and Last-Modified from the same response. Existing local files are preserved on transient failure; missing files still raise an error so the run exits non-zero.
Summary
Replace the existing
proxmox.sh(apt-sync + lftp) with a tsumugu-based driver plus a small Python parser for/iso.This is a full rewrite, not an add-on. The old script does not run correctly against the current
download.proxmox.comlayout.Why
download.proxmox.comis no longer a single Apache directory tree./iso/now serves a custom HTML page rather than nginx/Apache autoindex, solftp -mcannot enumerate it./debian,/pmg,/pve,/develand/isoinstead of running two unrelated tools.What changes
proxmox.shis replaced (42 → 219 lines)./debian,/pmg,/pve,/devel,/isosiblings underTUNASYNC_WORKING_DIR.REPO_SIZE_FILEandhelpers/size-sum.sh.BASE_URL/BASE_PATHenv conventions.The Python
/isoparser:<a href>entries from the custom HTML, ignoring anchors and navigation links.Content-Lengthdiffers from local size..tmp+os.replace(atomic).TUNASYNC_PROXMOX_ISO_MAXDELETE(default 100).