Skip to content

Add pip-index.py for PEP 503 simple-index style mirrors#205

Open
yaoge123 wants to merge 3 commits into
tuna:masterfrom
yaoge123:add-pip-index-py
Open

Add pip-index.py for PEP 503 simple-index style mirrors#205
yaoge123 wants to merge 3 commits into
tuna:masterfrom
yaoge123:add-pip-index-py

Conversation

@yaoge123
Copy link
Copy Markdown
Contributor

@yaoge123 yaoge123 commented May 24, 2026

Summary

Add pip-index.py, a generic mirror script for PEP 503 / PEP 691 "simple" index style upstreams.

Adapted from ustclug/ustcmirror-images pytorch/sync.py (originally pytorch.py). Crawls a simple HTML index recursively and rewrites href attributes so saved index pages point back to this mirror. The same script can drive multiple tunasync jobs by parameterising endpoint discovery and href rewriting via environment variables.

Additions over the upstream script

Variable Purpose
REWRITE_HOSTS Multi-host href rewrite. Handles upstreams that emit absolute URLs across multiple origins, e.g. download.pytorch.org and download-r2.pytorch.org.
EXTRA_REWRITES Redirect hrefs that point at sibling hosts (e.g. files.pythonhosted.org) to a sibling local mirror prefix (e.g. /pypi/web), so transitive deps stay on the same mirror without forcing users to set --extra-index-url.
DEVPI_MODE Query a devpi server's channel JSON API and crawl only that channel's own projects via .../<channel>/+simple/<project>/ instead of walking the full inherited PyPI namespace.
USE_PYTORCH_RELEASES, GET_ALL, NO_NIGHTLY Retain PyTorch-specific behaviours from the original sync.py.
URLBASE, JOBS, TIMEOUT, DRY_RUN URLBASE defaults to /{TUNASYNC_MIRROR_NAME}/; JOBS and TIMEOUT have script-level defaults so most jobs need no env.

aiohttp is used instead of httpx, so the script runs inside the existing shared tunathu/tunasync-scripts:latest image with no extra dependencies.

The Issue #109-style fix (rewrite absolute download-r2.pytorch.org URLs) was originally proposed as ustclug/ustcmirror-images#154.

Copilot AI review requested due to automatic review settings May 24, 2026 09:21
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a standalone async mirroring script for PEP 503 / PEP 691 “simple” Python package indexes, including href rewriting and optional devpi channel expansion.

Changes:

  • Introduces pip-index.py crawler/downloader built on aiohttp with concurrency control.
  • Adds configurable href rewriting (REWRITE_HOSTS, EXTRA_REWRITES) to point saved indexes back to local mirrors.
  • Adds optional devpi JSON-based discovery mode (DEVPI_MODE) and optional PyTorch releases URL discovery.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pip-index.py Outdated
Comment on lines +268 to +270
path = unquote(urlparse(url).path)
while path.startswith("/"):
path = path[1:]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added safe_local_path() that joins the unquoted URL path under base and verifies the resolved path stays inside base (rejects '..', absolute path components, NUL). recursive_download now uses it for both index pages and downloaded files.

Comment thread pip-index.py
Comment on lines +271 to +281
if url.endswith("/") or url.endswith(".html"):
# index.html (current) or torch_stable.html (old)
async with sem:
logging.info(f"Getting {url}")
contents = await get_with_progress(client, url)
index_resp = contents.decode("utf-8")
if url.endswith("/"):
filename = "index.html"
else:
filename = url.split("/")[-1]
assert filename.endswith(".html"), f"Unexpected HTML file: {filename}"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. When the URL ends in .html the local path now drops the .html filename to obtain the parent directory, so we write index_resp to / instead of mkdir-ing on top of .

Comment thread pip-index.py
Comment on lines +308 to +312
if not dry_run:
index_resp = rewrite_index(index_resp)
os.makedirs(base / path, exist_ok=True)
with overwrite(base / path / filename, "w") as f:
f.write(index_resp)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the line-281 reply.

Comment thread pip-index.py
pass


async def get_with_progress(client: aiohttp.ClientSession, url: str) -> bytes:
Copy link
Copy Markdown
Contributor Author

@yaoge123 yaoge123 May 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated: I rolled the asyncio.to_thread(fh.write, chunk) wrap back. Looking at the rest of the repo, every other downloader (apt-sync.py, adoptium.py, github-release.py, homebrew-bottles.py, docker-ce.py, yum-sync.py, github-raw.py, pub-mirror.py, nixos-images.py, nix-channels.py) just calls f.write(chunk) synchronously. The aiohttp-side chunks are already small (iter_chunked(65536)) and disk writes here are well below the network rate, so handing each write off to a thread adds overhead without solving a real problem. Concurrency comes from the per-file JOBS semaphore on the network side. Kept stream_to_file for the .tmp -> rename atomicity. Reverted in de8d0d5.

Comment thread pip-index.py
Comment on lines +195 to +199
chunks = []
try:
async for chunk in resp.content.iter_chunked(65536):
downloaded += len(chunk)
chunks.append(chunk)
Copy link
Copy Markdown
Contributor Author

@yaoge123 yaoge123 May 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated: I rolled the asyncio.to_thread(fh.write, chunk) wrap back. Looking at the rest of the repo, every other downloader (apt-sync.py, adoptium.py, github-release.py, homebrew-bottles.py, docker-ce.py, yum-sync.py, github-raw.py, pub-mirror.py, nixos-images.py, nix-channels.py) just calls f.write(chunk) synchronously. The aiohttp-side chunks are already small (iter_chunked(65536)) and disk writes here are well below the network rate, so handing each write off to a thread adds overhead without solving a real problem. Concurrency comes from the per-file JOBS semaphore on the network side. Kept stream_to_file for the .tmp -> rename atomicity. Reverted in de8d0d5.

Comment thread pip-index.py
suburl = urljoin(upstream_base, suburl)
else:
suburl = urljoin(url, suburl)
tasks.append(asyncio.create_task(recursive_download(client, suburl)))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the line-292 reply.

Comment thread pip-index.py
Comment on lines +306 to +307
if tasks:
await asyncio.gather(*tasks)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the line-292 reply.

Comment thread pip-index.py Outdated
Comment on lines +314 to +315
if (base / path).exists():
return
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the line-292 reply.

Comment thread pip-index.py Outdated
return b"".join(chunks)
except Exception as e:
if attempt == 2:
raise e
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to bare 'raise' inside the except blocks (both in get_with_progress and recursive_download) so the original traceback is preserved.

Comment thread pip-index.py Outdated
if e.status == 403:
logging.warning(f"Forbidden: {url}, skipping.")
else:
raise e
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the line-209 reply: bare 'raise' now.

Adapted from ustclug/ustcmirror-images pytorch/sync.py (originally
pytorch.py). Crawls PEP 503 / PEP 691 simple HTML indexes recursively
and rewrites href attributes so saved index pages point back to this
mirror. The same script is used for multiple jobs by parameterising
endpoint discovery and href rewriting via environment variables.

Additions on top of the upstream script:

  - Multi-host href rewrite (REWRITE_HOSTS):
    handles upstreams that emit absolute URLs across multiple origins
    (e.g. download.pytorch.org + download-r2.pytorch.org).

  - EXTRA_REWRITES:
    redirect hrefs that point at sibling hosts (e.g. files.pythonhosted.org)
    to a sibling local mirror prefix (e.g. /pypi/web), so transitive deps
    stay on the same mirror without forcing users to set --extra-index-url.

  - DEVPI_MODE:
    query a devpi server's channel JSON API and crawl only that channel's
    own projects via .../<channel>/+simple/<project>/ instead of walking
    the full inherited PyPI namespace.

  - aiohttp instead of httpx, so it can run inside the existing shared
    tunathu/tunasync-scripts:latest image with no extra dependencies.
@yaoge123 yaoge123 force-pushed the add-pip-index-py branch from c472eb1 to 39aeb9a Compare May 24, 2026 09:25
yaoge123 added 2 commits May 24, 2026 17:50
Address review feedback:
- Add safe_local_path() that resolves URL paths under base and rejects
  any candidate that escapes base (rejects '..', absolute paths, NUL).
- Stop treating '.../torch_stable.html' as a directory: when the URL
  ends with .html, write into <parent>/<filename>, not <full path>/.
- Add stream_to_file(): wheels are now written chunk-by-chunk through
  asyncio.to_thread(fh.write, chunk) into a sibling .tmp file, which is
  unlinked on failure and atomically renamed on success. The previous
  implementation buffered the whole response in memory before writing.
- Maintain a process-wide visited URL set so cross-linked index pages
  do not re-enter recursive_download() for the same URL.
- Replace 'raise e' in get_with_progress and recursive_download with a
  bare 'raise' so the original traceback is preserved.
Per Copilot review reflection: matching the rest of the repo (apt-sync.py,
adoptium.py, github-release.py, etc.) which all do plain f.write(chunk)
inside the download loop. With default JOBS=1 there is no concurrent
coroutine to starve, and asyncio.to_thread per chunk adds ~50us thread
scheduling overhead that outweighs any event-loop unblocking benefit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants