Skip to content

Boyuc/observatory draft demo#19288

Draft
quic-boyuc wants to merge 65 commits intopytorch:mainfrom
CodeLinaro:boyuc/observatory_draft_demo
Draft

Boyuc/observatory draft demo#19288
quic-boyuc wants to merge 65 commits intopytorch:mainfrom
CodeLinaro:boyuc/observatory_draft_demo

Conversation

@quic-boyuc
Copy link
Copy Markdown
Contributor

[RFC Draft] Observatory: a unified debugging framework for ExecuTorch

RFC: rfc_concise.md · reference.md

This is the reference POC accompanying the Observatory RFC — draft PR opened so reviewers can read the RFC and run the code side by side.

Summary

Observatory turns per-backend debugging scripts into one shared flow: each backend contributes lenses (Python extensions that implement debugging logic end-to-end — capture, analyze, render); the framework handles the session lifecycle, report assembly, and the outputs. fx_viewer is a standalone, dependency-free FX-graph renderer used as Observatory's graph view, also usable by anyone with a torch.fx graph.

Design rationale, architecture, and extension patterns are in the RFC. This PR is what the design looks like when you build it.

Relationship to the RFC

  • The RFC describes the ideal design. It treats APIs (json_frontend, --compare) and features (Analyzed Report JSON, cross-time regression) as first-class design elements even where the code hasn't caught up yet. RFC §8 is the canonical reference for what's implemented vs proposed.
  • This PR implements a subset. What ships here is enough to demo the full end-to-end flow on Qualcomm and XNNPACK with per-layer accuracy analysis. The not-yet-implemented pieces are tracked as follow-up work, not blockers for reviewing the design.
  • We welcome design-level feedback in this PR. If you disagree with the Lens protocol shape, the runtime/analyzed split (RFC §4.5), or the two-frontend plan, surface it here — easier to change the design before the proposed pieces land.

What's in this PR (current demo scope)

  • Core frameworkdevtools/observatory/
    • observatory.py — session lifecycle, nested config stack, capture store, report assembly
    • interfaces.py — Lens protocol, typed frontend block contracts
    • graph_hub.py — base graph + analyze-phase overlay merge; fx_viewer bridge
    • cli.py — generic CLI with collect and visualize modes
    • observe_pass.py@observe_pass decorator
  • FX viewerdevtools/fx_viewer/ — FX extraction, Sugiyama layout, extension-layer API, canvas-based JS runtime
  • Seven common lensesdevtools/observatory/lenses/: graph, metadata, accuracy, per_layer_accuracy, stack_trace, pipeline_graph_collector, graph_color
  • Backend CLIs as worked examples:
    • backends/qualcomm/debugger/observatory/
    • backends/xnnpack/debugger/observatory/
  • Invocation surfaces: generic CLI, backend CLI, Observatory.enable_context(...) context manager, @observe_pass decorator, direct Observatory.collect(name, artifact)
  • Exports:
    • HTML Report (self-contained, for reviewers)
    • Raw Capture (JSON) (export_json + visualize reload path)

What's in the RFC but NOT in this PR

See RFC §8 for the full list with rationale. Headliners:

  • json_frontend + Analyzed Report (JSON) — the second frontend hook for LLM triage, CI analytics, and dashboards.
  • --compare CLI mode — cross-time regression over archived Raw Captures.
  • Runtime / delegated-graph accuracy lens — port of qnn_intermediate_debugger.py.
  • Additional lenses — partition color layer, qparams audit, .pte diff, size analysis, ETDump-fed runtime lenses, ADB capture.
  • Non-FX graph formats (PyTorch graph, QNN graph, TOSA) as first-class fx_viewer exporters.
  • Nightly-regression CI recipe packaging the --compare flow.
  • Live debugging dashboard built on fx_viewer for streaming event use cases.

Each item has a natural landing point in the Lens protocol or CLI; no breaking changes to what ships here.

How to try it

pip3 install 'fast-sugiyama[full]'   # requires python >= 3.11

# XNNPACK — per-layer accuracy demo, zero code change
python -m executorch.backends.xnnpack.debugger.observatory \
    --output-html /tmp/mv2/obs_report.html \
    --lens_recipe=accuracy \
    examples/xnnpack/aot_compiler.py \
    --model_name=mv2 --delegate --quantize --output_dir /tmp/mv2

# Qualcomm — same pattern
python -m executorch.backends.qualcomm.debugger.observatory \
    --output-html obs_report.html \
    --lens_recipe=accuracy \
    examples/qualcomm/oss_scripts/mobilevit_v2.py \
    --backend htp --model SM8650 -d ./imagenet-mini-val/ \
    -b build-android/ --compile_only

# Reload a saved Raw Capture into a fresh HTML (uses current lens code)
python -m executorch.devtools.observatory visualize \
    --input-json run.json --output-html run.html

Pre-generated HTML reports on a matrix of models are linked from the demo index (see RFC §3).

Review guidance

Suggested reading order for reviewers:

RFC §1–§4 — problem, demo, and architecture in one sitting.
devtools/observatory/interfaces.py — the Lens protocol as actually coded.
devtools/observatory/observatory.py — how the Core drives the protocol.
A shipped lens (suggested: lenses/per_layer_accuracy.py) — the protocol exercised end-to-end.
One backend contribution (backends/qualcomm/debugger/observatory/) — what a backend plugs in.
On a first pass, the generated JS/CSS under devtools/fx_viewer/templates/ can be treated as an implementation detail — the Python surface is where the design lives.

Test plan

Unit tests under devtools/observatory/tests/ pass
End-to-end XNNPACK MV2 run produces expected HTML report
End-to-end Qualcomm MobileViT v2 run produces expected HTML report
visualize reload produces HTML equivalent to the original run
Lint / typecheck
Pre-generated reports on model matrix spot-checked manually

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 5, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19288

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

⚠️ 11 Awaiting Approval

As of commit 1dd5421 with merge base 0a113f8 (image):

AWAITING APPROVAL - The following workflows need approval before CI can run:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 5, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@quic-boyuc quic-boyuc force-pushed the boyuc/observatory_draft_demo branch from 53faf3c to 2f5a882 Compare May 5, 2026 04:33
quic-boyuc and others added 26 commits May 5, 2026 12:34
- Use python as clean API and function arguments
- Fix bug in html_template.py (\n --> \\n)
…otstrap

Three fixes for HTML report correctness and size:

1. Base64-encode HtmlBlock.content in the JSON payload so </script> and
   other special characters cannot corrupt the outer <script> tag.
   The JS runtime decodes with atob() before innerHTML assignment
   (03_blocks.js, renderHtmlCompare).

2. Gzip+base64 compress the full JSON payload when it exceeds 8 KB
   (observatory.py _compress_payload). The browser decompresses via
   DecompressionStream inside an async IIFE, which also moves the
   Observatory runtime execution to after the FX viewer bundle is
   injected — fixing the "FXGraphViewer unavailable" race condition
   that existed when Script 3 ran before Script 2 finished awaiting
   (html_template.py).

3. Base64-encode resources.js[] entries in generate_ui_test_harness.py
   so the test harness goes through the same pipeline as production
   reports instead of bypassing it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs fixed:

1. Wrong initial camera on first load — viewer.init() was called
   synchronously before the browser had laid out the container, so
   getBoundingClientRect() returned {width:0, height:0} and the camera
   was placed far off-screen. Fixed by deferring init() to the next
   animation frame (requestAnimationFrame).

2. All viewer state lost on every record switch — destroyGraphRuntime()
   was unconditionally destroying every live viewer. Camera, selected
   node, active layers, colorBy, and zoom were all reset.

Fix: hybrid viewer cache.

Single-record graph blocks use a live DOM cache keyed by
(recordIndex, lensName, blockId). On navigate-away the wrapper is
detached from the DOM but the viewer stays alive in state.viewerCache.
On return the wrapper is re-appended and a resize rAF is queued — no
re-init, no re-layout, full state preserved. LRU eviction at 10 viewers.

Compare-mode viewers are always freshly created (keeping N side-by-side
viewers alive would multiply memory cost). Instead a lightweight state
snapshot {camera, selectedNodeId, activeExtensions, colorBy} is saved
to state.compareStateCache on every statechange event. On re-entry each
new viewer is seeded from the snapshot: selectNode+animate if a node was
selected, setState({camera}) otherwise.

README section 12 documents the memory budget and trade-off rationale.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…acy lenses

Introduces a `python -m executorch.backends.qualcomm.debugger.observatory.cli`
runner that wraps any standard Qualcomm example script (e.g. swin_v2_t.py)
to automatically enable full Observatory debugging with no script modifications.

Key changes:

PipelineGraphCollectorLens (new):
- Absorbs ETRecordAutoCollector from the deleted auto_collect.py, moving all
  monkey-patching out of the Observatory core framework and into a lens.
- Patches framework-level functions (torch.export.export, prepare_pt2e,
  convert_pt2e, to_edge_transform_and_lower, ETRecord.add_*) to auto-collect
  graph snapshots at each compilation stage: Exported Float, Annotated Model,
  Calibrated Model, Quantized Model, Edge, Transformed Edge, ETRecord records.
- Forces generate_etrecord=True in to_edge_transform_and_lower to ensure
  ETRecord collection fires automatically.
- Framework-level patches work for all backends (QNN, XNNPack, CoreML, etc.).

AccuracyLens (new):
- Ports AccuracyEvaluationLens from legacy debugging_utils to new Observatory
  interfaces (ViewList/TableBlock).
- Adds MaskedTokenAccuracy metric and MLMEvaluator for masked language model
  scripts (bert, roberta, distilbert, albert, eurobert).
- Auto-patches get_imagenet_dataset and get_masked_language_model_dataset to
  capture targets, covering 24 of ~30 standard oss_scripts.
- Auto-detects task type (classification vs MLM), post_process function, and
  default metrics from model output format.

CLI runner (new):
- cli.py + __main__.py: parse observatory flags, register lenses, wrap target
  script in Observatory.enable_context(), run via runpy.run_path(), generate
  HTML + JSON report to {artifact}/observatory_report.{html,json}.
- Flags: --no-accuracy, --no-report, --report-title.

observatory.py:
- Remove ETRecordAutoCollector.install/uninstall calls; patching is now
  delegated to PipelineGraphCollectorLens when registered.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…st "Exported Float" record

AccuracyLens was not producing metrics in HTML reports because the evaluator
was configured in a build_executorch_binary POST-hook, but all
Observatory.collect() calls happen DURING that function — before the
evaluator existed.

The fix removes the build_executorch_binary patch entirely and instead
configures the evaluator lazily when AccuracyLens.observe() first sees the
"Exported Float" record:

1. Extract the float model from the ExportedProgram artifact
2. Use captured dataset (from get_imagenet_dataset patch) as primary source,
   or fall back to sample inputs captured by PipelineGraphCollectorLens
3. Auto-detect task type, post_process, and metrics
4. Compute golden outputs and build the evaluator

PipelineGraphCollectorLens now also captures the sample input tuple from
torch.export.export(mod, args, ...) as _last_export_inputs, providing a
fallback dataset for AccuracyLens when dataset loader patches don't fire
(e.g., custom datasets, non-Qualcomm backends).

Changes:
- accuracy.py: remove _install_build_binary_patch(), remove _captured_model,
  add _configure_from_float_model(), update observe() with lazy init
- pipeline_graph_collector.py: add _last_export_inputs, capture args[1] in
  patched_export, clear on uninstall/clear
- LENSES.md: new reference doc covering all lenses, observation points,
  patching strategy, accuracy lens lazy configuration, data source fallback
  strategy, and custom usage examples

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…screen

Five UI improvements to the Observatory HTML report:

1. Auto-hide sidebar (main.css, 02_layout.js)
   The left index pane is now position:fixed and hidden by default
   (translateX(-100%)). A 12px invisible trigger strip on the left edge
   reveals it on hover; the pane itself stays open while hovered. Main
   content takes the full viewport width — no left margin needed.

2. Auto-hide header (main.css, 02_layout.js)
   The header is now position:fixed and hidden by default
   (translateY(-100%)). A 10px invisible trigger strip at the top edge
   reveals it on hover. Both panels overlay content rather than consuming
   layout space, maximising the graph viewer area.

3. Theme sync Observatory → fx_viewer (04_actions.js, 03_blocks.js)
   setTheme() now propagates the theme to all live mountedViewers via
   viewer.setTheme(theme) and to all mountedCompares by iterating
   compare.viewers. New viewers receive themeName in their initial state
   so they open in the current theme without a separate setTheme call.

4. Compare snap merge-on-write (03_blocks.js)
   Root cause: all viewers in a compare group share one compareStateCache
   entry. Each viewer registers a statechange listener that overwrites the
   snap. When viewer B pans after viewer A selects a node, viewer B's
   statechange fires with selectedNodeId=null and wipes the selection.
   On the next restore both viewers see null → both zoomToFit().

   Fix: the statechange write is now a merge. selectedNodeId is only
   updated when the incoming value is non-null, preserving any selection
   set by any viewer until another viewer explicitly selects a different
   node. camera/activeExtensions/colorBy are always overwritten with the
   latest value.

   Restore priority order (unchanged logic, now actually reachable):
   - node exists in this viewer's graph → selectNode + animate
   - node not found (different record) or no selection → zoomToFit()
   - no snapshot at all → init() default positioning

5. Fullscreen button — correct API key (03_blocks.js)
   Previous implementation passed ui.controls.fullscreenButton which is
   not a recognised fx_viewer config key. Correct key per
   RFC_FX_VIEWER_API_INTERFACE.md and fx_graph_viewer.js default
   (fullscreen: { enabled: true, button: false }) is
   layout.fullscreen.button. Changed ViewerCtor.create() call to
   layout: { preset, fullscreen: { button: true } }.

README section 12.2 updated to document the merge-on-write behaviour,
the corrected restore priority order (no longer uses setState({camera})
as fallback), and the per-viewer node-existence check semantics.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ple stats, worst-index sharing

## What changed

### 1. Table view cleanup
Remove (diff) entries from the per-record accuracy table.  Diffs already appear
in the left panel index via check_index_diffs() — duplicating them in the table
added noise without value.

### 2. PSNR cap at 100.0 dB
Raw PSNR above 100 dB (e.g. 128 dB for near-zero observer error in the
Annotated Model stage) is not meaningfully different from perfect match and
produced confusing display.  PSNR.MAX_PSNR = 100.0 gives a uniform ceiling:
perfect match → 100.0, real quantization degradation → actual dB below 100.

### 3. Redesigned Metric base class
Replace the Protocol stub with a proper base class:

  class Metric:
      higher_is_better: bool = True   # controls worst-case direction
      def calculate_per_sample(self, predictions) -> List[float]: ...
      def calculate(self, predictions) -> float:          # mean of per-sample
      def worst_index(self, per_sample) -> int:           # argmin or argmax

higher_is_better encodes each metric's direction knowledge:
  True  → worst = argmin  (PSNR, cosine_sim, TopK — lower is worse)
  False → worst = argmax  (MSE, AbsErr — higher is worse)

All existing metrics (PSNR, CosineSimilarity, TopKAccuracy, MaskedTokenAccuracy)
are refactored to implement calculate_per_sample() instead of calculate().

### 4. New metrics: MSE and AbsErr
Both demonstrate higher_is_better=False and are added to the default evaluator
alongside PSNR and CosineSimilarity whenever golden outputs are available.

### 5. Per-sample statistics in Evaluator.evaluate()
When dataset has >1 sample, each metric emits three additional keys:
  {name}_min        — best sample value
  {name}_max        — worst sample value (in the metric's own direction)
  {name}_worst_idx  — dataset index of the worst-performing sample

Single-sample datasets emit only the primary mean value (no _min/_max/_worst_idx)
to keep the digest clean for the common fallback case.

### 6. Cross-lens worst-index sharing via AccuracyLens._worst_indices
AccuracyLens now maintains a class-level dict:
  _worst_indices: Dict[str, int]  # {metric_name: dataset_index}

Updated after every evaluate() call, cleared on session end / Observatory.clear().
Future lenses (e.g. per-layer accuracy analysis) read it during their own
observe() without re-running inference:

  from .accuracy import AccuracyLens
  worst = AccuracyLens._worst_indices.get("psnr")  # int or None

This follows the same pattern as PipelineGraphCollectorLens._last_export_inputs.
AccuracyLens must be registered before any lens that reads _worst_indices.

### 7. Frontend: second table for worst indices
_AccuracyFrontend.record() now emits two TableBlocks:
  - "Accuracy" (order=20): all metric values + min/max, no worst_idx keys
  - "Worst Input Index (per metric)" (order=21): stripped worst_idx values,
    only shown when dataset has >1 sample

check_index_diffs() extended with "mse" and "abs_err" keys.

### 8. LENSES.md updated
Documents: PSNR cap rationale, higher_is_better table, per-sample statistics
contract, _worst_indices cross-lens sharing pattern with code example, updated
expected metric behavior table.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Compare view redesign (FXGraphCompare + FXCompareTaskbar)

The previous implementation had three interacting layout bugs:
1. FXCompareTaskbar injected its div as a child of the CSS grid container,
   making it a grid item and pushing viewer columns into wrong positions.
2. sharedTaskbar.enabled defaulted to true, activating the taskbar on every
   existing FXGraphCompare.create() call including JS08 and JS99.
3. layout.tiled defaulted to true, restructuring viewer DOM unconditionally.

The compare view is now redesigned around a single owned DOM shell:

  layout.container
    .fx-compare-root          (flex column — created by FXGraphCompare)
      .fx-compare-taskbar     (optional — only when sharedTaskbar.enabled)
      .fx-compare-grid        (CSS grid, repeat(N, 1fr) columns)
        .fx-compare-col       (one per viewer, flex column)
          .fx-compare-col-header
          .fx-compare-minimap-row   (fixed height — uniform across all cols)
            viewer.minimapRenderer.container  (moved here)
          .fx-compare-canvas-row    (flex:1 — uniform across all cols)
            viewer.mainArea         (moved here)
      .fx-compare-info-bar    (single shared merged info panel)

Key design decisions:
- FXGraphCompare owns the compare DOM entirely. It moves viewer.mainArea and
  viewer.minimapRenderer.container into compare columns and hides viewer.wrapper.
  DOM snapshots (parent + nextSibling) are recorded before any move; destroy()
  calls _teardownCompareDOM() to restore every element to its original position.
- Uniform row heights are guaranteed structurally: all minimap rows are the same
  fixed height (layout.minimapHeight, default 180px); all canvas rows share
  flex:1 in the same flex column, so they expand to identical remaining space.
  No per-column height negotiation needed.
- sharedTaskbar.enabled defaults to false (opt-in). Existing callers are
  unaffected. FXCompareTaskbar prepends to .fx-compare-root, not to the user's
  container, so it is never a grid item.
- Tiled layout is the only compare layout. setTiled() and setCompact() are
  no-ops kept for backward compatibility. applyTiledLayout() removed from
  FXGraphViewer. All .fx-tiled CSS removed.
- A ResizeObserver on each .fx-compare-canvas-row keeps canvases sized
  correctly on window resize and column count changes.
- Merged info panel (_updateMergedInfo) renders a diff table into
  .fx-compare-info-bar after selection sync. Rows where values differ across
  graphs are highlighted amber (.fx-diff).
- Selection sync supports three modes: 'none', 'id' (default), 'layer'
  (match by extensions[layer].nodes[id].info[field] value; topologically last
  on ties).

## UI fixes (Issues 1–5)

Issue 1 — Button icon conflicts:
- Zoom to Fit icon changed from ⛶ (U+26F6) to ⤢ (U+2922, NORTH EAST AND
  SOUTH WEST ARROW) to distinguish it from the fullscreen button.
- Clear Selection button removed entirely — redundant (search clears on empty
  input, canvas click deselects) and its ✖ icon conflicted with fullscreen
  exit ✕.

Issue 2 — Split layout proportions + canvas min-width:
- Minimap default height changed from 500px to 240px (was 1:1 with info panel;
  now ~1:2 ratio, info panel takes remaining flex:1 space).
- .fx-main-area gets min-width: 60% so sidebar cannot consume most of the
  wrapper.
- Sidebar width cap in setupResizer() changed from containerRect.width - 200
  to containerRect.width * 0.4, enforcing canvas always gets ≥60% of width.

Issue 3 — Minimap fails to render in custom div / observatory:
- Added ResizeObserver on MinimapRenderer.container. When the container
  transitions from 0 to non-zero size (collapsed section expanded, custom div
  appended), calls resize() + generateThumbnail() + render(). Handles all
  deferred-visibility cases without polling.

Issue 4 — Tiled compare mode (see redesign above).

Issue 5 — ADV02 info panel height instability in slot mode:
- When slots.info is set, FXGraphViewer now applies overflow:hidden and
  min-height:0 to the slot element, and height:100%/overflow-y:auto to the
  info panel, preventing grid row expansion on content growth.
- ADV02 HTML updated: min-height:0 added to right-column grid container and
  slot divs.

## Test case updates (harness_testcases.py)

- JS08: removed compact/sync-theme controls (now handled by compare API);
  updated to new FXGraphCompare API (sync.mode:'id'); uses hidden mount divs
  so FXGraphCompare can build its own DOM shell.
- ADV04: new test case demonstrating compare with shared taskbar, sync by ID,
  and merged info panel.
- JS99: updated mount IDs, removed stale c99_sync_theme/c99_compact handlers,
  theme change now applies to both viewers.

## Documentation

- README.md: replaced stale Compare section with full architecture reference
  including DOM structure, ownership rules, lifecycle, interaction control
  table, sync modes, and merged info panel description.
- templates/README.md: added Compare View DOM and Ownership section with DOM
  tree, ownership rules, resize handling, and interaction ownership table.
- RFC_FX_VIEWER_API_INTERFACE.md: updated Compare API section with new config
  shape and semantics.
- examples/FX_VIEWER_API_TESTCASES.md: added ADV04 test case description.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…reduce roberta/vit data sizes

Authored with Claude
… dedup, compare.js split, themes→runtime

5 independently scoped changes:

1. Move verbose JSDoc narrative blocks (USE CASES / ALGORITHM / UX) from 6 JS
   files to templates/README.md under a new Runtime Internals section. Each class
   now has a single-line description. ~390 lines removed from JS.

2. Add fxEsc() HTML-escaping helper to runtime.js. Apply it to all innerHTML
   construction sites in ui_manager.js (rebuildLayersMenu, renderLegend,
   updateInfoPanel, updateEdgeInfoPanel) and fx_graph_viewer.js
   (_rebuildLayersMenu, _rebuildSyncSelect, _updateMergedInfo). Search highlight
   spans are intentionally left as raw HTML with a comment.

3. Add GraphDataStore.computeBoundsForNodes(nodeIds) and two private helpers
   _collect2HopNeighbors / _collectEdgeNeighbors on ViewerController. Refactor
   zoomToFit() to use them, eliminating two identical 15-line bounds loops.

4. Split fx_graph_viewer.js: move FXCompareTaskbar + FXGraphCompare to new
   templates/compare.js (loads after fx_graph_viewer.js). Update ordered_files
   in exporter.py and generate_api_test_harness.py.

5. Rename themes.js → runtime.js (better reflects content: fxOn/fxOffAll/fxEsc
   + THEMES). Update ordered_files in both Python files and README.

Authored with Claude
quic-boyuc and others added 25 commits May 5, 2026 12:34
Surfaces the inherent 2-step design (runtime collection → JSON, JSON → HTML)
explicitly in the CLI. Adds a `visualize` subcommand that converts an existing
JSON to HTML without re-running the export script, and a `--json-only` flag for
CI/storage use cases. Fixes export order so JSON is always written before HTML.
Adds USAGE.md covering zero-config e2e, two-step workflow, lens config, manual
collection points, and demo script modes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes highlighted nodes (path, selected edge, search candidates, target)
to use scale-relative sizes instead of fixed pixel values, so highlights
remain visible at all zoom levels.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the 983-line technical-reference README with a progressive-disclosure
document (concept -> workflow -> usage -> extension -> reference links). Moves
all contract tables, API references, JS callbacks, and performance notes into
a new REFERENCE.md. New README introduces Observatory to new users before
diving into architecture details.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move the Observatory debugging framework and fx_viewer graph visualization
from backends/qualcomm/ to devtools/ so all ExecuTorch backends can use
them. Backend-specific patches (QNN ptq_calibrate, XNNPACK quantize, QNN
dataset capture) are extracted into their respective backend directories
with a register_backend_patches() hook mechanism.

Each backend maintains its own CLI runner (--accuracy opt-in flag):
- python -m executorch.devtools.observatory (generic)
- python -m executorch.backends.qualcomm.debugger.observatory (QNN)
- python -m executorch.backends.xnnpack.debugger.observatory (XNNPACK)

Includes RFC document at devtools/observatory/RFC_OBSERVATORY_SHARED_FRAMEWORK.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add observe_pass, a decorator that wraps any PassBase subclass or callable
pass to automatically collect input/output graphs via Observatory. Names
are derived from the class/function name and deduplicated by collect()
itself (#2, pytorch#3, ...) so repeated calls never silently overwrite records.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ayer accuracy

Add `default_sync` field to `GraphCompareSpec` so lenses can declare
their preferred compare sync strategy. The per-layer accuracy lens now
defaults to `mode: "layer"` on `sparse_match_key` instead of the
generic auto mode, giving immediate node correspondence when comparing
records side-by-side.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the vendored grandalf fork with the Rust-backed fast-sugiyama
package and rewrite the post-layout compaction in exporter.py to
produce a natural top-down edge flow.

Layout changes in _compute_layout_with_ext_lines / _compact_components:
- Drive layout via fast-sugiyama's from_edges + rect_pack_layouts.
- Run per-layer spine cohesion (chain detection across real and dummy
  nodes, iterative mean-pull, pure-A overlap repair) so linear op
  chains render as straight spines.
- Flip Phase 6's y so graph inputs / placeholders land at the top of
  the canvas and outputs / sinks at the bottom.
- Anchor edge endpoints to the bottom-midpoint of the source and
  top-midpoint of the target, giving each node a predictable dock.

Also refresh the README's dependency / layout section to point at
fast-sugiyama[all] and drop the grandalf install instructions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. use chain detection and iterative spine refinement to avoid overlap
and restore compactness and alignment

2. chain detection leverage node depth (rank) to detect longest possible
chains

3. chains are attracted to each other by sharing start and end nodes, in
spine refinement, mean attracition center will be shifted based on end nodes.
…st-aware node text

The per-layer accuracy lens was coloring nodes and table cells with a color
range recomputed independently per record, so the same cosine/PSNR/MSE/abs_err
value mapped to different colors in different records and broke side-by-side
comparison. Separately, fx_viewer was drawing node labels in theme.text
regardless of node fill, so extension layers (e.g. this lens) that assigned
dark or near-white fills produced unreadable text.

Changes:

devtools/observatory/lenses/per_layer_accuracy.py — `_MetricNumericColorRule`
now accepts an optional `fixed_range`; range resolution is factored into
`_resolve_range()`. `analyze()` aggregates per-metric (vmin, vmax) across every
record's rows once via `_aggregate_metric_ranges()`, stashes them in
`AnalysisResult.global_data["metric_ranges"]`, and threads the cosine_sim range
into each per-record `_build_metric_extension(...)`. The HTML metrics table is
now parameterized by `metric_ranges` and the frontend `record()` reads them
from `analysis["global"]`, so table and graph extension share one scale and
the 5-point legend auto-reflects the unified range. Per-record fallback is
preserved for callers without analysis context.

devtools/fx_viewer/templates/runtime.js — new `fxReadableTextColor(hex)`
helper using WCAG 2.x relative luminance (0.2126·R + 0.7152·G + 0.0722·B on
linearized sRGB). Returns `#111111` or `#f8f8f8` by higher-contrast test, or
`null` on malformed input so callers can fall back to theme defaults. Palette
matches the lens HTML table's `_text_color_for_bg` for visual consistency.

devtools/fx_viewer/templates/canvas_renderer.js — the node draw loop now
tracks `renderedFill` through the selected/preview/hovered/input/output
shading branches and picks the label ink from `fxReadableTextColor(renderedFill)`
whenever the node carries a custom `fill_color` from any extension; nodes
without an extension color keep the theme default. No payload/schema changes,
so the contrast rule applies retroactively to every existing extension layer.

devtools/observatory/tests/test_per_layer_accuracy_lens.py — added
`test_analyze_produces_unified_metric_ranges_across_records`: builds two
synthetic digests, asserts `global_data["metric_ranges"]` spans the union
across records, and asserts that a node with the same metric value in both
records receives the same `fill_color` (proving the color no longer depends
on which record the node happens to appear in).

Review order: the lens module is the load-bearing change — start there, then
the matching test, then the two fx_viewer templates which are local and
additive.

Authored with Claude Code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@quic-boyuc quic-boyuc force-pushed the boyuc/observatory_draft_demo branch from 2f5a882 to 6a3afad Compare May 5, 2026 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant