Add pip-index.py for PEP 503 simple-index style mirrors#205
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a standalone async mirroring script for PEP 503 / PEP 691 “simple” Python package indexes, including href rewriting and optional devpi channel expansion.
Changes:
- Introduces
pip-index.pycrawler/downloader built onaiohttpwith concurrency control. - Adds configurable href rewriting (
REWRITE_HOSTS,EXTRA_REWRITES) to point saved indexes back to local mirrors. - Adds optional devpi JSON-based discovery mode (
DEVPI_MODE) and optional PyTorch releases URL discovery.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| path = unquote(urlparse(url).path) | ||
| while path.startswith("/"): | ||
| path = path[1:] |
There was a problem hiding this comment.
Added safe_local_path() that joins the unquoted URL path under base and verifies the resolved path stays inside base (rejects '..', absolute path components, NUL). recursive_download now uses it for both index pages and downloaded files.
| if url.endswith("/") or url.endswith(".html"): | ||
| # index.html (current) or torch_stable.html (old) | ||
| async with sem: | ||
| logging.info(f"Getting {url}") | ||
| contents = await get_with_progress(client, url) | ||
| index_resp = contents.decode("utf-8") | ||
| if url.endswith("/"): | ||
| filename = "index.html" | ||
| else: | ||
| filename = url.split("/")[-1] | ||
| assert filename.endswith(".html"), f"Unexpected HTML file: {filename}" |
There was a problem hiding this comment.
Fixed. When the URL ends in .html the local path now drops the .html filename to obtain the parent directory, so we write index_resp to / instead of mkdir-ing on top of .
| if not dry_run: | ||
| index_resp = rewrite_index(index_resp) | ||
| os.makedirs(base / path, exist_ok=True) | ||
| with overwrite(base / path / filename, "w") as f: | ||
| f.write(index_resp) |
There was a problem hiding this comment.
Same as the line-281 reply.
| pass | ||
|
|
||
|
|
||
| async def get_with_progress(client: aiohttp.ClientSession, url: str) -> bytes: |
There was a problem hiding this comment.
Updated: I rolled the asyncio.to_thread(fh.write, chunk) wrap back. Looking at the rest of the repo, every other downloader (apt-sync.py, adoptium.py, github-release.py, homebrew-bottles.py, docker-ce.py, yum-sync.py, github-raw.py, pub-mirror.py, nixos-images.py, nix-channels.py) just calls f.write(chunk) synchronously. The aiohttp-side chunks are already small (iter_chunked(65536)) and disk writes here are well below the network rate, so handing each write off to a thread adds overhead without solving a real problem. Concurrency comes from the per-file JOBS semaphore on the network side. Kept stream_to_file for the .tmp -> rename atomicity. Reverted in de8d0d5.
| chunks = [] | ||
| try: | ||
| async for chunk in resp.content.iter_chunked(65536): | ||
| downloaded += len(chunk) | ||
| chunks.append(chunk) |
There was a problem hiding this comment.
Updated: I rolled the asyncio.to_thread(fh.write, chunk) wrap back. Looking at the rest of the repo, every other downloader (apt-sync.py, adoptium.py, github-release.py, homebrew-bottles.py, docker-ce.py, yum-sync.py, github-raw.py, pub-mirror.py, nixos-images.py, nix-channels.py) just calls f.write(chunk) synchronously. The aiohttp-side chunks are already small (iter_chunked(65536)) and disk writes here are well below the network rate, so handing each write off to a thread adds overhead without solving a real problem. Concurrency comes from the per-file JOBS semaphore on the network side. Kept stream_to_file for the .tmp -> rename atomicity. Reverted in de8d0d5.
| suburl = urljoin(upstream_base, suburl) | ||
| else: | ||
| suburl = urljoin(url, suburl) | ||
| tasks.append(asyncio.create_task(recursive_download(client, suburl))) |
There was a problem hiding this comment.
Same as the line-292 reply.
| if tasks: | ||
| await asyncio.gather(*tasks) |
There was a problem hiding this comment.
Same as the line-292 reply.
| if (base / path).exists(): | ||
| return |
There was a problem hiding this comment.
Same as the line-292 reply.
| return b"".join(chunks) | ||
| except Exception as e: | ||
| if attempt == 2: | ||
| raise e |
There was a problem hiding this comment.
Switched to bare 'raise' inside the except blocks (both in get_with_progress and recursive_download) so the original traceback is preserved.
| if e.status == 403: | ||
| logging.warning(f"Forbidden: {url}, skipping.") | ||
| else: | ||
| raise e |
There was a problem hiding this comment.
Same as the line-209 reply: bare 'raise' now.
Adapted from ustclug/ustcmirror-images pytorch/sync.py (originally
pytorch.py). Crawls PEP 503 / PEP 691 simple HTML indexes recursively
and rewrites href attributes so saved index pages point back to this
mirror. The same script is used for multiple jobs by parameterising
endpoint discovery and href rewriting via environment variables.
Additions on top of the upstream script:
- Multi-host href rewrite (REWRITE_HOSTS):
handles upstreams that emit absolute URLs across multiple origins
(e.g. download.pytorch.org + download-r2.pytorch.org).
- EXTRA_REWRITES:
redirect hrefs that point at sibling hosts (e.g. files.pythonhosted.org)
to a sibling local mirror prefix (e.g. /pypi/web), so transitive deps
stay on the same mirror without forcing users to set --extra-index-url.
- DEVPI_MODE:
query a devpi server's channel JSON API and crawl only that channel's
own projects via .../<channel>/+simple/<project>/ instead of walking
the full inherited PyPI namespace.
- aiohttp instead of httpx, so it can run inside the existing shared
tunathu/tunasync-scripts:latest image with no extra dependencies.
c472eb1 to
39aeb9a
Compare
Address review feedback: - Add safe_local_path() that resolves URL paths under base and rejects any candidate that escapes base (rejects '..', absolute paths, NUL). - Stop treating '.../torch_stable.html' as a directory: when the URL ends with .html, write into <parent>/<filename>, not <full path>/. - Add stream_to_file(): wheels are now written chunk-by-chunk through asyncio.to_thread(fh.write, chunk) into a sibling .tmp file, which is unlinked on failure and atomically renamed on success. The previous implementation buffered the whole response in memory before writing. - Maintain a process-wide visited URL set so cross-linked index pages do not re-enter recursive_download() for the same URL. - Replace 'raise e' in get_with_progress and recursive_download with a bare 'raise' so the original traceback is preserved.
Per Copilot review reflection: matching the rest of the repo (apt-sync.py, adoptium.py, github-release.py, etc.) which all do plain f.write(chunk) inside the download loop. With default JOBS=1 there is no concurrent coroutine to starve, and asyncio.to_thread per chunk adds ~50us thread scheduling overhead that outweighs any event-loop unblocking benefit.
Summary
Add
pip-index.py, a generic mirror script for PEP 503 / PEP 691 "simple" index style upstreams.Adapted from
ustclug/ustcmirror-imagespytorch/sync.py(originallypytorch.py). Crawls a simple HTML index recursively and rewrites href attributes so saved index pages point back to this mirror. The same script can drive multiple tunasync jobs by parameterising endpoint discovery and href rewriting via environment variables.Additions over the upstream script
REWRITE_HOSTSdownload.pytorch.organddownload-r2.pytorch.org.EXTRA_REWRITESfiles.pythonhosted.org) to a sibling local mirror prefix (e.g./pypi/web), so transitive deps stay on the same mirror without forcing users to set--extra-index-url.DEVPI_MODE.../<channel>/+simple/<project>/instead of walking the full inherited PyPI namespace.USE_PYTORCH_RELEASES,GET_ALL,NO_NIGHTLYURLBASE,JOBS,TIMEOUT,DRY_RUN/{TUNASYNC_MIRROR_NAME}/; JOBS and TIMEOUT have script-level defaults so most jobs need no env.aiohttpis used instead ofhttpx, so the script runs inside the existing sharedtunathu/tunasync-scripts:latestimage with no extra dependencies.The Issue #109-style fix (rewrite absolute
download-r2.pytorch.orgURLs) was originally proposed as ustclug/ustcmirror-images#154.