Skip to content

Rewrite proxmox.sh: replace apt-sync+lftp with tsumugu + /iso parser#206

Open
yaoge123 wants to merge 2 commits into
tuna:masterfrom
yaoge123:rewrite-proxmox-sh
Open

Rewrite proxmox.sh: replace apt-sync+lftp with tsumugu + /iso parser#206
yaoge123 wants to merge 2 commits into
tuna:masterfrom
yaoge123:rewrite-proxmox-sh

Conversation

@yaoge123
Copy link
Copy Markdown
Contributor

@yaoge123 yaoge123 commented May 24, 2026

Summary

Replace the existing proxmox.sh (apt-sync + lftp) with a tsumugu-based driver plus a small Python parser for /iso.

This is a full rewrite, not an add-on. The old script does not run correctly against the current download.proxmox.com layout.

Why

  1. download.proxmox.com is no longer a single Apache directory tree. /iso/ now serves a custom HTML page rather than nginx/Apache autoindex, so lftp -m cannot enumerate it.
  2. tsumugu correctly handles both nginx-style autoindex and the custom HTML when given an explicit parser hint, so a single coherent walk can cover /debian, /pmg, /pve, /devel and /iso instead of running two unrelated tools.
  3. Splitting the deb sync (apt-sync) from the ISO sync (lftp) means a partial failure on either side leaves the mirror in an inconsistent state.

What changes

  • proxmox.sh is replaced (42 → 219 lines).
  • Behaviour preserved:
    • Same workdir layout: /debian, /pmg, /pve, /devel, /iso siblings under TUNASYNC_WORKING_DIR.
    • Same tunasync size reporting via REPO_SIZE_FILE and helpers/size-sum.sh.
    • Same BASE_URL / BASE_PATH env conventions.

The Python /iso parser:

  • Extracts <a href> entries from the custom HTML, ignoring anchors and navigation links.
  • HEADs every candidate file, downloads only when remote Content-Length differs from local size.
  • Writes via .tmp + os.replace (atomic).
  • Bounds deletes by TUNASYNC_PROXMOX_ISO_MAXDELETE (default 100).

The current proxmox.sh runs apt-sync.py for the deb repos and lftp for
the ISO directory. This rewrite replaces both with a single tsumugu run
plus a small Python parser for /iso, which is necessary because:

  1. download.proxmox.com is no longer a single Apache directory tree;
     /iso/ now serves a custom HTML page (not autoindex).
  2. tsumugu correctly handles both nginx-style autoindex and the custom
     HTML, given a parser hint, so we get a single coherent walk over
     /debian, /pmg, /pve, /devel and /iso instead of two tools.
  3. Splitting deb sync (apt-sync) from ISO sync (lftp) means partial
     failures on one side leave the mirror in inconsistent state.

The Python /iso parser:
  - extracts <a href> entries, ignoring anchors and navigation
  - HEADs every candidate file, downloads only when size differs
  - writes via .tmp + os.replace (atomic)
  - bounds deletes by TUNASYNC_PROXMOX_ISO_MAXDELETE (default 100)

Behavior preserved:
  - same workdir layout (/debian, /pmg, /pve, /devel, /iso siblings)
  - same logging style
  - tunasync size reporting via REPO_SIZE_FILE / size-sum.sh

This is a full replacement, not an add-on; the old script does not run
correctly against today's /iso layout.
Copilot AI review requested due to automatic review settings May 24, 2026 09:22
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates the Proxmox mirror sync wrapper to safely synchronize /debian and /images via tsumugu, while handling the non-indexed /iso page with a custom HTML parser/downloader to avoid incorrect parsing and unsafe deletions.

Changes:

  • Replace prior apt-sync/lftp approach with tsumugu sync scoped per subtree for safe cleanup behavior.
  • Add an embedded Python /iso sync stage that discovers artifacts from the custom HTML page, performs HEAD metadata checks, downloads atomically, and deletes stale files with a bounded limit.
  • Add extensive in-script rationale documenting why /iso must be handled separately.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread proxmox.sh Outdated
# requires: wget, timeout
set -e
set -o pipefail
set -eu
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Switched 'set -eu' to 'set -euo pipefail' so a failure inside the USERAGENT pipeline (or any future pipeline) is no longer hidden by a successful tail.

Comment thread proxmox.sh
WORKDIR=${TUNASYNC_WORKING_DIR:?TUNASYNC_WORKING_DIR is required}
THREADS=${TUNASYNC_TSUMUGU_THREADS:-1}
MAXDELETE=${TUNASYNC_TSUMUGU_MAXDELETE:-10000}
USERAGENT=${TUNASYNC_TSUMUGU_USERAGENT:-"tsumugu/$(tsumugu --version | tail -n1 | cut -d' ' -f2)"}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the line-2 reply: pipefail is now on.

Comment thread proxmox.sh
WORKDIR=${TUNASYNC_WORKING_DIR:?TUNASYNC_WORKING_DIR is required}
THREADS=${TUNASYNC_TSUMUGU_THREADS:-1}
MAXDELETE=${TUNASYNC_TSUMUGU_MAXDELETE:-10000}
USERAGENT=${TUNASYNC_TSUMUGU_USERAGENT:-"tsumugu/$(tsumugu --version | tail -n1 | cut -d' ' -f2)"}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. After computing USERAGENT we 'export TUNASYNC_TSUMUGU_USERAGENT="$USERAGENT"' so the Python /iso stage inherits the same value. The tsumugu invocation continues to pass --user-agent "$USERAGENT". Both stages now agree on UA.

Comment thread proxmox.sh
base = os.environ.get('TUNASYNC_UPSTREAM_URL', 'http://download.proxmox.com/').rstrip('/') + '/iso/'
work = Path(os.environ['TUNASYNC_WORKING_DIR']) / 'iso'
work.mkdir(parents=True, exist_ok=True)
user_agent = os.environ.get('TUNASYNC_TSUMUGU_USERAGENT', 'tsumugu')
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the line-35 reply: USERAGENT is exported into the Python stage's environment.

Comment thread proxmox.sh
# separately reviewed after implementation and all official *.sha256 files
# matched their local *.iso payloads. Future reviews can repeat that check.
socket.setdefaulttimeout(60)
base = os.environ.get('TUNASYNC_UPSTREAM_URL', 'http://download.proxmox.com/').rstrip('/') + '/iso/'
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The Python stage now derives 'allowed_netloc = urllib.parse.urlparse(base).netloc' from TUNASYNC_UPSTREAM_URL and validates each candidate URL against that. The hardcoded 'download.proxmox.com' string is gone.

Comment thread proxmox.sh
Comment on lines +133 to +136
url = urllib.parse.urljoin(base, href)
parsed = urllib.parse.urlparse(url)
if parsed.scheme not in ('http', 'https') or parsed.netloc != 'download.proxmox.com':
continue
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the line-107 reply.

Comment thread proxmox.sh
Comment on lines +167 to +170
head_req = urllib.request.Request(url, method='HEAD', headers={'User-Agent': user_agent})
with urllib.request.urlopen(head_req, timeout=30) as resp:
size = int(resp.headers.get('Content-Length', '-1'))
lm = resp.headers.get('Last-Modified')
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a Range: bytes=0-0 GET fallback. When HEAD raises an exception we issue 'Range: bytes=0-0', extract Content-Range/Last-Modified, and proceed. Existing files are kept on transient failure; missing files still produce a non-zero exit at the end of the run.

Comment thread proxmox.sh Outdated
Comment on lines +175 to +184
except Exception as e:
# A transient HEAD failure for a file we already have should not delete
# or re-download the local file. A missing local file plus HEAD failure
# is recorded as an error so the job exits non-zero after processing.
if target.exists():
print(f'proxmox iso: HEAD failed for existing {name}: {e}; keeping local file', flush=True)
continue
print(f'proxmox iso: HEAD failed for missing {name}: {e}', file=sys.stderr, flush=True)
errors.append(f'{name}: HEAD {e}')
continue
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the line-170 reply: GET-with-Range fallback now covers HEAD-rejecting CDNs.

Address review feedback:
- Switch from 'set -eu' to 'set -euo pipefail' so failures inside the
  USERAGENT pipeline (and any future pipelines) are caught.
- Export TUNASYNC_TSUMUGU_USERAGENT after computing it so the value
  passed to tsumugu sync also reaches the Python /iso stage; the two
  stages can no longer disagree on User-Agent.
- Derive the allowed origin for /iso file URLs from
  TUNASYNC_UPSTREAM_URL via urlparse(...).netloc instead of hardcoding
  'download.proxmox.com', so a different upstream tunasync deploys
  through this script does not get rejected.
- Add a Range: bytes=0-0 GET fallback when HEAD fails (some CDNs
  reject HEAD with 405 or 403). We extract size from Content-Range
  and Last-Modified from the same response. Existing local files are
  preserved on transient failure; missing files still raise an error
  so the run exits non-zero.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants