Skip to content

Remove Wondershaper upload rate limit for Google API traffic in AoU#397

Closed
yonghaoy wants to merge 4 commits into
masterfrom
remove-wondershaper-google-api-ratelimit
Closed

Remove Wondershaper upload rate limit for Google API traffic in AoU#397
yonghaoy wants to merge 4 commits into
masterfrom
remove-wondershaper-google-api-ratelimit

Conversation

@yonghaoy
Copy link
Copy Markdown
Contributor

@yonghaoy yonghaoy commented May 7, 2026

Summary

  • Exempt restricted.googleapis.com (199.36.153.4/30) traffic from Wondershaper upload rate limiting in AoU Workbench apps
  • Exempt Dataproc master-to-worker node traffic from Wondershaper upload rate limiting (wondershaper runs on the master node)
  • All other outbound upload rate limiting remains unchanged
  • Requires security approval before implementation (Jira: PHP-148769)

Problem

Wondershaper currently rate-limits all upload traffic indiscriminately from the master node, including:

  1. Traffic to restricted.googleapis.com — the VPC Service Controls restricted VIP used for all Google API access (GCS, BigQuery, etc.) within the security perimeter
  2. Master-to-worker node traffic — Dataproc cluster communication for task distribution, shuffle data, and health checks

This causes:

  • Hail is nearly unusable — jobs that should take minutes time out or take hours; intermediate GCS writes via restricted.googleapis.com are throttled, causing cascading pipeline failures. Hail is a core genomics analysis tool and a primary reason researchers use the platform.
  • Dataproc is severely degraded — master-to-worker communication is throttled, causing slow job distribution, unpredictable autoscaling, shuffle bottlenecks, and frequent job failures.
  • File uploads to GCS are extremely slow — uploading datasets in the tens/hundreds of GB range (normal for genomic data) from the app to the researcher's own bucket is impractical.
  • Overall platform trust is eroding — researchers perceive the Workbench as unreliable; features that are advertised do not function as expected.

Proposed Changes

1. Exempt restricted.googleapis.com (199.36.153.4/30)

Bypass the upload rate limit for traffic destined to the restricted Google API VIP. This is the only Google API endpoint reachable from within the VPC Service Perimeter.

2. Exempt master-to-worker node traffic

Bypass the upload rate limit for traffic from the Dataproc master node (where wondershaper runs) to worker nodes on the same VPC subnet. This is internal cluster communication, not egress.

Why This Is Low Risk

restricted.googleapis.com

  • VPC Service Perimeter already blocks cross-perimeter exfiltration. All Google API calls via restricted.googleapis.com are scoped to the service perimeter. A user cannot access any Google Cloud resource outside the perimeter — the request is denied at the platform level regardless of bandwidth.
  • restricted.googleapis.com is the most locked-down VIP. Unlike private.googleapis.com, the restricted VIP only allows access to APIs that are supported by VPC Service Controls. This is the VIP specifically designed for high-security environments.
  • Rate limiting is a redundant, weaker control. It only slows exfiltration; VPC SC denies it entirely. We are removing a weak control that duplicates a stronger one.
  • IAM and OAuth scopes remain enforced. No new permissions or access are granted.
  • Cloud Audit Logs remain active. Full visibility into all Google API activity is unchanged.

Master-to-worker traffic

  • This is internal cluster communication, not internet egress. Traffic between Dataproc master and worker nodes stays within the VPC — it never leaves the network boundary.
  • Worker nodes are within the same VPC Service Perimeter. No data crosses a trust boundary.
  • Wondershaper was never intended to rate-limit intra-cluster traffic. Its purpose is to guard against data exfiltration to the public internet.

Defense-in-depth summary

Threat Control Status After This Change
Data exfiltration to external Google Cloud resources VPC Service Perimeter (hard block) Unchanged
Data exfiltration to non-Google internet Wondershaper rate limit Unchanged
Unauthorized Google API actions IAM + OAuth scopes Unchanged
Undetected API misuse Cloud Audit Logs + monitoring Unchanged
Lateral movement via Google APIs VPC SC + IAM Unchanged
Intra-cluster data movement VPC network controls Unchanged

Test plan

  • Obtain security approval
  • Add tc filter rules to exempt 199.36.153.4/30 from upload rate limiting
  • Add tc filter rules to exempt worker node subnet traffic from upload rate limiting
  • Verify Hail jobs complete at expected speed
  • Verify Dataproc master-worker communication is unthrottled
  • Verify GCS uploads via restricted.googleapis.com perform at expected speed
  • Verify non-Google upload rate limiting still functions
  • Validate rollback by redeploying previous Wondershaper image

🤖 Generated with Claude Code

github-actions and others added 2 commits May 7, 2026 10:23
…ogle APIs

Placeholder commit for tracking the security approval process to exempt
Google API traffic from the Wondershaper upload rate limit in AoU apps.
Implementation will follow once approval is granted.

Jira: PHP-148769

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds the aou-sas app with build-time setup in the Dockerfile (packages,
gcsfuse, gcloud SDK, user creation, SAS config, Apache proxy), a runtime
startup script for volume-dependent steps, Mikey Secrets integration for
SAS license delivery, and removes initializeCommand from devcontainers.

Jira: PHP-148769

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yonghaoy yonghaoy requested review from a team as code owners May 7, 2026 14:34
github-actions and others added 2 commits May 7, 2026 15:20
…shaper rate limit

Add a wrapper entrypoint script that runs alongside wondershaper and
adds tc filter exemptions for:
- restricted.googleapis.com (199.36.153.4/30): already guarded by VPC
  Service Perimeter, rate limiting is redundant
- Internal VPC traffic (10.0.0.0/8): covers Dataproc master-to-worker
  communication, which is intra-cluster and not internet egress

Applied to all 5 AoU apps via a shared script in aou-common.

Jira: PHP-148769

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yonghaoy yonghaoy closed this May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant