Add pip-index.py for PEP 503 simple-index style mirrors by yaoge123 · Pull Request #205 · tuna/tunasync-scripts

yaoge123 · 2026-05-24T09:21:48Z

Summary

Add pip-index.py, a generic mirror script for PEP 503 / PEP 691 "simple" index style upstreams.

Adapted from ustclug/ustcmirror-images pytorch/sync.py (originally pytorch.py). Crawls a simple HTML index recursively and rewrites href attributes so saved index pages point back to this mirror. The same script can drive multiple tunasync jobs by parameterising endpoint discovery and href rewriting via environment variables.

Additions over the upstream script

Variable	Purpose
`REWRITE_HOSTS`	Multi-host href rewrite. Handles upstreams that emit absolute URLs across multiple origins, e.g. `download.pytorch.org` and `download-r2.pytorch.org`.
`EXTRA_REWRITES`	Redirect hrefs that point at sibling hosts (e.g. `files.pythonhosted.org`) to a sibling local mirror prefix (e.g. `/pypi/web`), so transitive deps stay on the same mirror without forcing users to set `--extra-index-url`.
`DEVPI_MODE`	Query a devpi server's channel JSON API and crawl only that channel's own projects via `.../<channel>/+simple/<project>/` instead of walking the full inherited PyPI namespace.
`USE_PYTORCH_RELEASES`, `GET_ALL`, `NO_NIGHTLY`	Retain PyTorch-specific behaviours from the original sync.py.
`URLBASE`, `JOBS`, `TIMEOUT`, `DRY_RUN`	URLBASE defaults to `/{TUNASYNC_MIRROR_NAME}/`; JOBS and TIMEOUT have script-level defaults so most jobs need no env.

aiohttp is used instead of httpx, so the script runs inside the existing shared tunathu/tunasync-scripts:latest image with no extra dependencies.

The Issue #109-style fix (rewrite absolute download-r2.pytorch.org URLs) was originally proposed as ustclug/ustcmirror-images#154.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a standalone async mirroring script for PEP 503 / PEP 691 “simple” Python package indexes, including href rewriting and optional devpi channel expansion.

Changes:

Introduces pip-index.py crawler/downloader built on aiohttp with concurrency control.
Adds configurable href rewriting (REWRITE_HOSTS, EXTRA_REWRITES) to point saved indexes back to local mirrors.
Adds optional devpi JSON-based discovery mode (DEVPI_MODE) and optional PyTorch releases URL discovery.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

yaoge123 · 2026-05-24T09:50:24Z

+    path = unquote(urlparse(url).path)
+    while path.startswith("/"):
+        path = path[1:]


Added safe_local_path() that joins the unquoted URL path under base and verifies the resolved path stays inside base (rejects '..', absolute path components, NUL). recursive_download now uses it for both index pages and downloaded files.

yaoge123 · 2026-05-24T09:50:26Z

+    if url.endswith("/") or url.endswith(".html"):
+        # index.html (current) or torch_stable.html (old)
+        async with sem:
+            logging.info(f"Getting {url}")
+            contents = await get_with_progress(client, url)
+            index_resp = contents.decode("utf-8")
+            if url.endswith("/"):
+                filename = "index.html"
+            else:
+                filename = url.split("/")[-1]
+                assert filename.endswith(".html"), f"Unexpected HTML file: {filename}"


Fixed. When the URL ends in .html the local path now drops the .html filename to obtain the parent directory, so we write index_resp to / instead of mkdir-ing on top of .

yaoge123 · 2026-05-24T09:50:27Z

+        if not dry_run:
+            index_resp = rewrite_index(index_resp)
+            os.makedirs(base / path, exist_ok=True)
+            with overwrite(base / path / filename, "w") as f:
+                f.write(index_resp)


Same as the line-281 reply.

yaoge123 · 2026-05-24T09:50:29Z

+        pass
+
+
+async def get_with_progress(client: aiohttp.ClientSession, url: str) -> bytes:


Updated: I rolled the asyncio.to_thread(fh.write, chunk) wrap back. Looking at the rest of the repo, every other downloader (apt-sync.py, adoptium.py, github-release.py, homebrew-bottles.py, docker-ce.py, yum-sync.py, github-raw.py, pub-mirror.py, nixos-images.py, nix-channels.py) just calls f.write(chunk) synchronously. The aiohttp-side chunks are already small (iter_chunked(65536)) and disk writes here are well below the network rate, so handing each write off to a thread adds overhead without solving a real problem. Concurrency comes from the per-file JOBS semaphore on the network side. Kept stream_to_file for the .tmp -> rename atomicity. Reverted in de8d0d5.

yaoge123 · 2026-05-24T09:50:30Z

+                chunks = []
+                try:
+                    async for chunk in resp.content.iter_chunked(65536):
+                        downloaded += len(chunk)
+                        chunks.append(chunk)


Updated: I rolled the asyncio.to_thread(fh.write, chunk) wrap back. Looking at the rest of the repo, every other downloader (apt-sync.py, adoptium.py, github-release.py, homebrew-bottles.py, docker-ce.py, yum-sync.py, github-raw.py, pub-mirror.py, nixos-images.py, nix-channels.py) just calls f.write(chunk) synchronously. The aiohttp-side chunks are already small (iter_chunked(65536)) and disk writes here are well below the network rate, so handing each write off to a thread adds overhead without solving a real problem. Concurrency comes from the per-file JOBS semaphore on the network side. Kept stream_to_file for the .tmp -> rename atomicity. Reverted in de8d0d5.

yaoge123 · 2026-05-24T09:50:37Z

+                suburl = urljoin(upstream_base, suburl)
+            else:
+                suburl = urljoin(url, suburl)
+            tasks.append(asyncio.create_task(recursive_download(client, suburl)))


Same as the line-292 reply.

yaoge123 · 2026-05-24T09:50:39Z

+        if tasks:
+            await asyncio.gather(*tasks)


Same as the line-292 reply.

yaoge123 · 2026-05-24T09:50:41Z

+        if (base / path).exists():
+            return


Same as the line-292 reply.

yaoge123 · 2026-05-24T09:50:42Z

+                return b"".join(chunks)
+        except Exception as e:
+            if attempt == 2:
+                raise e


Switched to bare 'raise' inside the except blocks (both in get_with_progress and recursive_download) so the original traceback is preserved.

yaoge123 · 2026-05-24T09:50:44Z

+                    if e.status == 403:
+                        logging.warning(f"Forbidden: {url}, skipping.")
+                    else:
+                        raise e


Same as the line-209 reply: bare 'raise' now.

Adapted from ustclug/ustcmirror-images pytorch/sync.py (originally pytorch.py). Crawls PEP 503 / PEP 691 simple HTML indexes recursively and rewrites href attributes so saved index pages point back to this mirror. The same script is used for multiple jobs by parameterising endpoint discovery and href rewriting via environment variables. Additions on top of the upstream script: - Multi-host href rewrite (REWRITE_HOSTS): handles upstreams that emit absolute URLs across multiple origins (e.g. download.pytorch.org + download-r2.pytorch.org). - EXTRA_REWRITES: redirect hrefs that point at sibling hosts (e.g. files.pythonhosted.org) to a sibling local mirror prefix (e.g. /pypi/web), so transitive deps stay on the same mirror without forcing users to set --extra-index-url. - DEVPI_MODE: query a devpi server's channel JSON API and crawl only that channel's own projects via .../<channel>/+simple/<project>/ instead of walking the full inherited PyPI namespace. - aiohttp instead of httpx, so it can run inside the existing shared tunathu/tunasync-scripts:latest image with no extra dependencies.

Address review feedback: - Add safe_local_path() that resolves URL paths under base and rejects any candidate that escapes base (rejects '..', absolute paths, NUL). - Stop treating '.../torch_stable.html' as a directory: when the URL ends with .html, write into <parent>/<filename>, not <full path>/. - Add stream_to_file(): wheels are now written chunk-by-chunk through asyncio.to_thread(fh.write, chunk) into a sibling .tmp file, which is unlinked on failure and atomically renamed on success. The previous implementation buffered the whole response in memory before writing. - Maintain a process-wide visited URL set so cross-linked index pages do not re-enter recursive_download() for the same URL. - Replace 'raise e' in get_with_progress and recursive_download with a bare 'raise' so the original traceback is preserved.

Per Copilot review reflection: matching the rest of the repo (apt-sync.py, adoptium.py, github-release.py, etc.) which all do plain f.write(chunk) inside the download loop. With default JOBS=1 there is no concurrent coroutine to starve, and asyncio.to_thread per chunk adds ~50us thread scheduling overhead that outweighs any event-loop unblocking benefit.

Copilot AI review requested due to automatic review settings May 24, 2026 09:21

Copilot AI reviewed May 24, 2026

View reviewed changes

yaoge123 force-pushed the add-pip-index-py branch from c472eb1 to 39aeb9a Compare May 24, 2026 09:25

yaoge123 added 2 commits May 24, 2026 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pip-index.py for PEP 503 simple-index style mirrors#205

Add pip-index.py for PEP 503 simple-index style mirrors#205
yaoge123 wants to merge 3 commits into
tuna:masterfrom
yaoge123:add-pip-index-py

yaoge123 commented May 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

yaoge123 May 24, 2026

Uh oh!

yaoge123 May 24, 2026

Uh oh!

yaoge123 May 24, 2026

Uh oh!

yaoge123 May 24, 2026 •

edited

Loading

Uh oh!

yaoge123 May 24, 2026 •

edited

Loading

Uh oh!

yaoge123 May 24, 2026

Uh oh!

yaoge123 May 24, 2026

Uh oh!

yaoge123 May 24, 2026

Uh oh!

yaoge123 May 24, 2026

Uh oh!

yaoge123 May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		pass


		async def get_with_progress(client: aiohttp.ClientSession, url: str) -> bytes:

Conversation

yaoge123 commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Additions over the upstream script

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaoge123 May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaoge123 May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaoge123 commented May 24, 2026 •

edited

Loading

yaoge123 May 24, 2026 •

edited

Loading

yaoge123 May 24, 2026 •

edited

Loading