🐛 Fix RL Training and Improve Structure by flowerthrower · Pull Request #573 · munich-quantum-toolkit/predictor

flowerthrower · 2026-01-22T13:35:29Z

Description

This PR addresses critical bugs in the RL training process with the following key changes:

Structure Improvements:

Redesigned action validation logic (predictorenv.py): Rewrote determine_valid_actions_for_state() with a more structured (but equivalent) state machine that explicitly tracks three circuit states (synthesized, laid_out, routed) and handles 6 different state combinations.
- Added helper methods is_circuit_laid_out() and is_circuit_routed() to replace the buggy CheckMap pass with more reliable state checking. The new logic supports both the original restricted MDP and a flexible general MDP mode.
Fixed type annotation (actions.py): Corrected do_while parameter type from dict[str, Circuit] to PropertySet and added missing import for Qiskit's PropertySet.
Added reproducibility (predictor.py): Set random seed for non-test training runs to ensure reproducible results.
Improved VF2Layout error handling (predictorenv.py): Replaced assertion failures with warning logs when VF2Layout doesn't find a solution, preventing crashes during training.

Test Updates:

Suppressed deprecation warnings in tket routing test

denialhaag · 2026-01-25T10:15:55Z

It looks like #572 has created some conflicts here. Since the ty errors were quite ugly to fix, let me know if you want me to take care of the rebase so you don't have to come up with the same workarounds. 🙂

flowerthrower · 2026-02-23T20:39:05Z

@coderabbitai full review

coderabbitai · 2026-02-23T20:39:11Z

✅ Actions performed

Full review triggered.

coderabbitai · 2026-02-23T20:47:14Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Enhanced circuit state awareness with explicit validation for layout and routing phases.
Improvements
- Increased training reproducibility through deterministic initialization.
- Improved robustness in post-layout handling with graceful error management.

Walkthrough

This pull request updates action validation in the reinforcement learning module by changing the do_while predicate to use PropertySet instead of circuit dictionaries, adds deterministic seeding to the training path for reproducibility, introduces three new circuit state validation methods to explicitly check layout and routing status, relaxes a strict assertion in layout postprocessing to a warning, and updates a test to use an additional parameter in the pytket conversion function.

Changes

Cohort / File(s)	Summary
Type System Updates `src/mqt/predictor/rl/actions.py`	Updated `do_while` callable signature from `Callable[[dict[str, Circuit]], bool]` to `Callable[[PropertySet], bool]` with added PropertySet import under TYPE_CHECKING.
Environment State Validation `src/mqt/predictor/rl/predictorenv.py`	Added three new methods: `is_circuit_laid_out` (checks if all logical qubits assigned to physical qubits), `is_circuit_routed` (validates directed two-qubit gates on device coupling map), and `determine_valid_actions_for_state` (returns valid action list based on circuit state). Replaced hard assertion in VF2Layout postprocessing with warning. Updated imports to include Layout and removed CheckMap.
Training Configuration `src/mqt/predictor/rl/predictor.py`	Set random seed to 0 in non-test training path to ensure deterministic behavior alongside existing test-mode seeding.
Test Updates `tests/compilation/test_integration_further_SDKs.py`	Updated `test_tket_routing` to call `tk_to_qiskit` with `perm_warning=False` parameter.

Sequence Diagram(s)

sequenceDiagram
    participant Agent as RL Agent
    participant Env as PredictorEnv
    participant Circuit as QuantumCircuit
    participant Layout as Layout Object
    participant CouplingMap as Device CouplingMap
    
    Agent->>Env: Request valid actions for current state
    Env->>Circuit: Check circuit properties
    Env->>Layout: Retrieve layout information
    Env->>Env: is_circuit_laid_out(circuit, layout)
    rect rgba(100, 150, 200, 0.5)
        Note over Env: Verify all logical qubits<br/>mapped to physical qubits
    end
    Env->>CouplingMap: Get device coupling map
    Env->>Env: is_circuit_routed(circuit, coupling_map)
    rect rgba(100, 150, 200, 0.5)
        Note over Env: Verify directed 2-qubit gates<br/>respect device topology
    end
    Env->>Env: determine_valid_actions_for_state()
    rect rgba(100, 150, 200, 0.5)
        Note over Env: Synthesize valid actions based on<br/>synthesis, layout, routing, mapping states
    end
    Env->>Agent: Return list of valid action indices

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 The circuit takes its rightful place,
With validation at each trace,
State checks bloom where assertions fell,
PropertySet's tale to tell,
Our layout dances, routed true! 🎭

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title uses emoji and is too vague; 'Fix RL Training and Improve Structure' is generic and doesn't clearly convey the main changes or scope of the PR.	Replace with a more descriptive, emoji-free title that specifically mentions the key fixes, such as: 'Fix action validation state machine and type annotations in RL training'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description comprehensively covers the key changes across all modified files, explains the motivation for fixes, and includes details about bug resolutions and improvements. However, the checklist items are not explicitly marked as completed.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch bugfix

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 435-447: The current is_circuit_laid_out only checks qubits that
appear in instructions and misses idle logical qubits; update
is_circuit_laid_out (and its TranspileLayout handling) to validate every logical
qubit from the circuit (e.g., iterate circuit.qubits or
range(circuit.num_qubits)) against layout.get_virtual_bits() instead of
iterating only instr.qubits, and return False if any circuit qubit is not
present in v2p; keep the existing fallback to layout.final_layout or
layout.initial_layout and handle missing/None layout gracefully.
- Around line 360-363: Replace the eager string concatenation in the VF2Layout
check with logger-style formatting: instead of adding
pm.property_set["VF2Layout_stop_reason"] to the message, call logger.warning
with a format string and pass pm.property_set["VF2Layout_stop_reason"] as a
separate argument (referencing VF2Layout_stop_reason, VF2LayoutStopReason,
pm.property_set and logger.warning in predictorenv.py) so the message is
constructed lazily and satisfies Ruff G003.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26e3b01 and 825773f.

📒 Files selected for processing (4)

src/mqt/predictor/rl/actions.py
src/mqt/predictor/rl/predictor.py
src/mqt/predictor/rl/predictorenv.py
tests/compilation/test_integration_further_SDKs.py

coderabbitai · 2026-02-23T20:47:17Z

+    def is_circuit_laid_out(self, circuit: QuantumCircuit, layout: TranspileLayout | Layout) -> bool:
+        """True if every logical qubit in the circuit has a physical assignment."""
+        if isinstance(layout, TranspileLayout):
+            # Use final_layout if available; otherwise fallback to initial_layout
+            layout = layout.final_layout or layout.initial_layout

-        if not only_nat_gates:
-            actions = self.actions_synthesis_indices + self.actions_opt_indices
-            if self.layout is not None:
-                actions += self.actions_routing_indices
-            return actions
+        v2p = layout.get_virtual_bits()
+        for instr in circuit.data:
+            for q in instr.qubits:
+                if q not in v2p:
+                    # Logical qubit not assigned
+                    return False
+        return True


⚠️ Potential issue | 🟡 Minor

Validate all logical qubits, not just those appearing in instructions.
The docstring says “every logical qubit”; idle qubits are currently skipped.

🔧 Suggested fix

- for instr in circuit.data: - for q in instr.qubits: - if q not in v2p: - # Logical qubit not assigned - return False + for q in circuit.qubits: + if q not in v2p: + # Logical qubit not assigned + return False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/mqt/predictor/rl/predictorenv.py` around lines 435 - 447, The current is_circuit_laid_out only checks qubits that appear in instructions and misses idle logical qubits; update is_circuit_laid_out (and its TranspileLayout handling) to validate every logical qubit from the circuit (e.g., iterate circuit.qubits or range(circuit.num_qubits)) against layout.get_virtual_bits() instead of iterating only instr.qubits, and return False if any circuit qubit is not present in v2p; keep the existing fallback to layout.final_layout or layout.initial_layout and handle missing/None layout gracefully.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Patrick Hopf <81010725+flowerthrower@users.noreply.github.com>

flowerthrower · 2026-03-04T19:10:49Z

Problem

After a Qiskit layout pass (e.g., DenseLayout), the MDP state became:

synthesized = True
laid_out = True
routed = False

determine_valid_actions_for_state() correctly allowed only routing actions.

However, action_masks() filtered out all TKET actions because self.layout was not None.
Since the only registered routing action was RoutingPass (from TKET), this removed the last available routing option and produced an all-False action mask.

Consequence in MaskablePPO

sb3-contrib's MaskablePPO handles an all-False mask by setting all logits to a large negative number instead of negative infinity.

Applying softmax to these equal values results in a uniform distribution, meaning every action — including terminate — had a probability of about 1/22 (~4.5%).

During training with 1000 timesteps, the probability that terminate would never be sampled is extremely small (<0.003%), making a crash almost inevitable.

This never came up in tests that only executed 100 steps.

Crash Mechanism

When the agent sampled terminate in this deadlocked state:

step() re-evaluated determine_valid_actions_for_state()
It correctly detected routed = False
reward calculation failed

mergify · 2026-03-11T08:39:52Z

🧪 CI Insights

Here's what we observed from your CI run for 07e7338.

🟢 All jobs passed!

But CI Insights is watching 👀

flowerthrower · 2026-03-11T09:27:54Z

This PR will be merged into the qce-experiments branch with failing windows tests. The rational here is that all other platforms are succeeding and we want to make fast progress on the experiments branch. Furthermore, we have a strong suspicion that the problem will be resolved by subsequent experiment PRs anyway.

Copilot

Pull request overview

This PR updates the reinforcement-learning compilation pipeline to better track layout/routing state, improves Qiskit/TKET interop behavior, and refines training/reward calculation logic to support more robust RL-based compilation.

Changes:

Extend PredictorEnv with an mdp mode and new synthesized/layouted/routed state checks to drive valid-action selection.
Update TKET→Qiskit conversions to explicitly replace implicit swaps (env + integration test).
Adjust RL training defaults (esp. for tests) and simplify reward qubit indexing via QuantumCircuit.find_bit().

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`src/mqt/predictor/rl/predictorenv.py`	Adds `mdp` mode + new action-validity logic; updates layout/routing handling and unitary decomposition behavior.
`src/mqt/predictor/rl/predictor.py`	Tweaks test training hyperparameters and makes seeding deterministic in non-test training.
`src/mqt/predictor/rl/actions.py`	Updates `DeviceDependentAction.do_while` typing to match Qiskit `PropertySet`.
`src/mqt/predictor/reward.py`	Uses `find_bit()` for qubit indices and adds a clearer error message for missing 2Q gate calibration entries.
`tests/compilation/test_predictor_rl.py`	Increases RL training timesteps used in the integration-style test that generates models.
`tests/compilation/test_integration_further_SDKs.py`	Updates TKET routing test conversion to replace implicit swaps.

Comments suppressed due to low confidence (1)

src/mqt/predictor/rl/predictor.py:112

set_random_seed(0) is now called unconditionally in the non-test training path, which forces deterministic training for all normal usage. This is a behavioral change that may be surprising for users expecting stochastic training runs. Consider making the seed configurable (parameter/env var) or limiting the hard-coded seed to test=True only.

        else:
            set_random_seed(0)
            # default PPO values
            n_steps = 2048
            n_epochs = 10
            batch_size = 64
            progress_bar = True

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-11T09:29:37Z

+        logger.info("MDP: " + mdp)
+        self.mdp = mdp
+


mdp is accepted as an arbitrary string, but later logic assumes only "paper" or "flexible". If any other value is passed, determine_valid_actions_for_state() can return an empty list and step() will raise RuntimeError("No valid actions left."). Consider validating mdp in __init__ (e.g., Literal/Enum + ValueError on unknown values) to fail fast with a clear error.

Copilot · 2026-03-11T09:29:38Z

+        for instr in circuit.data:
+            for q in instr.qubits:
+                if q not in v2p:
+                    # Logical qubit not assigned
+                    return False


is_circuit_laid_out() claims to verify that every logical qubit has a physical assignment, but it only checks qubits that appear in instructions (for instr in circuit.data). This can return True even if some idle logical qubits in circuit.qubits are not present in the layout. If you really want “every logical qubit”, iterate over circuit.qubits (or otherwise ensure the layout covers all qubits), not just those used by operations.

Suggested change

for instr in circuit.data:

for q in instr.qubits:

if q not in v2p:

# Logical qubit not assigned

return False

for q in circuit.qubits:

if q not in v2p:

# Logical qubit not assigned

return False

Copilot · 2026-03-11T09:29:38Z

+        # Initial state
+        if not synthesized and not laid_out and not routed:
+            if self.mdp == "flexible":
+                actions.extend(self.actions_synthesis_indices)
+                actions.extend(self.actions_mapping_indices)
+                actions.extend(self.actions_layout_indices)
+                actions.extend(self.actions_opt_indices)
+            if self.mdp == "paper":


With the new mdp modes, the initial-state policy in determine_valid_actions_for_state() can include mapping/layout actions (for mdp == "flexible"), but reset() still initializes self.valid_actions to opt + synthesis only. This means the action mask at the first decision step ignores the mdp setting. Consider setting self.valid_actions = self.determine_valid_actions_for_state() during reset() so the first action is consistent with the selected MDP.

Copilot · 2026-03-11T09:29:38Z


    predictor.train_model(
-        timesteps=100,
+        timesteps=1000,


This test now trains with timesteps=1000 (previously 100). Since train_model() runs PPO rollouts and is executed during the test suite, this can significantly increase CI runtime and make the suite flaky/time out. Consider keeping timesteps minimal for CI (and/or gating longer training behind a pytest.mark.slow or an env flag) while still asserting the intended behavior.

Suggested change

timesteps=1000,

timesteps=100,

flowerthrower added 9 commits January 22, 2026 10:17

🐛 gracefully handle VF2Layout no-solution-found

8a86438

disable progress bar

f34a9ae

🐛 only consider valid circuits for terminal action

7a89375

🚧 debug predictor

af50584

🐛handle layout fail more gracefully

1e3d3d4

🚧 restructure state machine

25c8a4a

🚧 update state machine

7eb5138

🎨 add strict policy

37653fb

🚧 add og paper strategy

8802fbd

flowerthrower and others added 7 commits February 9, 2026 13:14

🎨 fix og policy

6db8ee7

⏪ remove thesis changes

dcc3810

Merge remote-tracking branch 'origin/main' into bugfix

440d54c

⏪ revert thesis updates

450d2dc

⏪ use og strategy

e69cf51

🐛 fix no-layout found bug

934e46c

Merge branch 'main' into bugfix

825773f

flowerthrower changed the title ~~Fix RL training bugs~~ 🐛 Fix RL Training and Improve Structure Feb 23, 2026

coderabbitai Bot requested changes Feb 23, 2026

View reviewed changes

flowerthrower and others added 7 commits February 23, 2026 22:06

Update src/mqt/predictor/rl/predictorenv.py

8b27982

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Patrick Hopf <81010725+flowerthrower@users.noreply.github.com>

🎨 docstring

729ade7

Merge branch 'main' into bugfix

983a4f6

🐛 fix routing check

de36b57

Merge commit '983a4f61b72d10af05bfe762bb72f27aa9304fac' into bugfix

bc667bd

🐛 use find qubit

7c745fe

🐛 fix no valid action bug

0d46fa9

flowerthrower and others added 2 commits March 10, 2026 17:50

🎨 add mdp attribute

7ecea77

Merge branch 'main' into bugfix

879d9cc

flowerthrower changed the base branch from main to qce-experiments March 11, 2026 08:12

flowerthrower added bug Something isn't working refactor PR or issues that refactor code labels Mar 11, 2026

flowerthrower marked this pull request as ready for review March 11, 2026 09:24

Copilot AI review requested due to automatic review settings March 11, 2026 09:24

Copilot started reviewing on behalf of flowerthrower March 11, 2026 09:24 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

flowerthrower added 2 commits March 11, 2026 10:56

🐛 update reset with flexible policy

1d26bc7

🐛 check cirucit qubits layout not instructions

07e7338

flowerthrower merged commit 90ec2cf into qce-experiments Mar 11, 2026
10 of 11 checks passed

flowerthrower deleted the bugfix branch March 11, 2026 11:24

coderabbitai Bot mentioned this pull request Apr 13, 2026

Implementation of the Predictor Tracer #641

Merged

10 tasks

Uh oh!

Conversation

flowerthrower commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

denialhaag commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flowerthrower commented Feb 23, 2026

Uh oh!

coderabbitai Bot commented Feb 23, 2026

Uh oh!

coderabbitai Bot commented Feb 23, 2026

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

flowerthrower commented Mar 4, 2026

Problem

Consequence in MaskablePPO

Crash Mechanism

Uh oh!

mergify Bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 CI Insights

🟢 All jobs passed!

Uh oh!

flowerthrower commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

flowerthrower commented Jan 22, 2026 •

edited

Loading

denialhaag commented Jan 25, 2026 •

edited

Loading

mergify Bot commented Mar 11, 2026 •

edited

Loading