test: mitigate e2e simulator hang / retry flakes#9057
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a series of stability improvements for end-to-end testing, specifically targeting flaky simulator behavior and WebSocket connection issues. By enhancing the recovery logic for both the test runner and the underlying iOS simulator, the changes aim to reduce CI noise caused by transient infrastructure failures. Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a patch to mocha-remote-server to handle transient client disconnects with a reconnect grace timer, and updates the E2E test suite to reboot the iOS simulator upon retryable launch failures. Feedback highlights a critical race condition in the server patch where a restarted client process may hang waiting for a run command that is never sent. Additionally, the synchronous reboot of the iOS simulator blocks the Node.js event loop, potentially freezing the WebSocket server; it is recommended to refactor this to be asynchronous and properly awaited.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #9057 +/- ##
============================================
+ Coverage 60.92% 62.23% +1.31%
============================================
Files 457 351 -106
Lines 33665 23396 -10269
Branches 5479 3978 -1501
============================================
- Hits 20508 14558 -5950
+ Misses 12026 8361 -3665
+ Partials 1131 477 -654
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
897a66b to
98c2229
Compare
Cap Gradle workers at min(physical_cpus, 6) to limit parallel heap pressure; 5GB daemon heap handles peak packageDebugAndroidTest load. Scales with hardware without overwhelming low-core CI machines.
…ilures Run modular getSessionId probes before all other analytics tests; drop namespace getSessionId coverage to avoid cross-test session interference.
Inline CI at Metro bundle time so Jet tests on device see the CI runner flag instead of an undefined process.env lookup.
Increase Jet reconnectGraceMs from 15s to 30s so transient WS 1006/1001 drops can recover before fatal exit during long debug+coverage runs.
Poll simctl boot state up to 120s before simctl install to avoid LaunchServices races when rebooting between Jet retry attempts.
Log loadavg/memory on transient disconnect, proactively pull coverage after reconnect, and default reconnect grace to 30s in the Jet patch.
Send pull-coverage when mocha-remote client reconnects mid-run and log coverage-ready receipt; align server reconnect grace default to 30s.
Ping keepalive on connect, log send readyState failures, and retry coverage-ready upload up to 3 times with backoff after reconnect.
Snapshot load, top, and e2e-related process stats every 10s into resource-monitor.log for correlating flakes with CPU/memory pressure.
Collate jet-ws, rnfb-e2e, lifecycle, and launch markers from CI logs into flake-summary.txt for faster post-run triage.
Stream testing/SpringBoard logs, run resource monitor, tee Detox output, write flake summary, and upload new diagnostic artifacts on failure.
Match FrontBoard/FBSOpenApplication launch errors and treat coverage teardown WebSocket failures as retryable Jet session failures.
Dump get_app_container/listapps before and after each launch attempt and log the Detox failure reason when launchAppWithRetry gives up.
Time terminateApp during launch retries and reboot the simulator when terminate exceeds RNFB_SLOW_TERMINATE_MS before relaunching.
Use shorter release launch timeout, skip delete on inner retry, and log liveMetro/delete flags to distinguish release stalls from Metro issues.
Mark exhausted inner launch retries as Jet-retryable so debug FrontBoard failures get a full simulator reboot instead of a terminal false.
Emit structured retry-eligibility checks on Jet attempt failure so CI logs show which sub-condition blocked or allowed the second attempt.
Update OKF bundle with new artifacts, boot-simulator shutdown wait, Jet WS/coverage handshake mitigations, FrontBoard launch flakes, and local stress iteration guidance.
ddf48bb to
b7c94a6
Compare
b7c94a6 to
f9b005f
Compare
Host-orchestrated Tart VMs with detached iteration, session-scoped artifacts, virtiofs completion polling, and optional SCP harvest (--no-sync-artifacts).
f9b005f to
7aa8d80
Compare
Snapshot host and guest loadavg during Detox runs, upload the log as a CI artifact, and include it in flake-summary triage.
Drop bootanim gate (CI uses -no-boot-anim). After adb reboot, wait for boot_completed, package handler queue, and guest loadavg below 5 before starting Jet attempt 2.
Await orchestration teardown, stop Jet, and force-stop the app before adb reboot so attempt 2 does not race with attempt 1 instrumentation.
Summary
This PR is intended to hold a continued series of e2e flake fixes.
Test plan