Skip to content

🐛 Fix RL Training and Improve Structure#573

Merged
flowerthrower merged 27 commits intoqce-experimentsfrom
bugfix
Mar 11, 2026
Merged

🐛 Fix RL Training and Improve Structure#573
flowerthrower merged 27 commits intoqce-experimentsfrom
bugfix

Conversation

@flowerthrower
Copy link
Copy Markdown
Collaborator

@flowerthrower flowerthrower commented Jan 22, 2026

Description

This PR addresses critical bugs in the RL training process with the following key changes:

Structure Improvements:

  • Redesigned action validation logic (predictorenv.py): Rewrote determine_valid_actions_for_state() with a more structured (but equivalent) state machine that explicitly tracks three circuit states (synthesized, laid_out, routed) and handles 6 different state combinations.
    - Added helper methods is_circuit_laid_out() and is_circuit_routed() to replace the buggy CheckMap pass with more reliable state checking. The new logic supports both the original restricted MDP and a flexible general MDP mode.

  • Fixed type annotation (actions.py): Corrected do_while parameter type from dict[str, Circuit] to PropertySet and added missing import for Qiskit's PropertySet.

  • Added reproducibility (predictor.py): Set random seed for non-test training runs to ensure reproducible results.

  • Improved VF2Layout error handling (predictorenv.py): Replaced assertion failures with warning logs when VF2Layout doesn't find a solution, preventing crashes during training.

Test Updates:

  • Suppressed deprecation warnings in tket routing test

@denialhaag
Copy link
Copy Markdown
Member

denialhaag commented Jan 25, 2026

It looks like #572 has created some conflicts here. Since the ty errors were quite ugly to fix, let me know if you want me to take care of the rebase so you don't have to come up with the same workarounds. 🙂

@flowerthrower flowerthrower changed the title Fix RL training bugs 🐛 Fix RL Training and Improve Structure Feb 23, 2026
@flowerthrower
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 23, 2026

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 23, 2026

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Enhanced circuit state awareness with explicit validation for layout and routing phases.
  • Improvements

    • Increased training reproducibility through deterministic initialization.
    • Improved robustness in post-layout handling with graceful error management.

Walkthrough

This pull request updates action validation in the reinforcement learning module by changing the do_while predicate to use PropertySet instead of circuit dictionaries, adds deterministic seeding to the training path for reproducibility, introduces three new circuit state validation methods to explicitly check layout and routing status, relaxes a strict assertion in layout postprocessing to a warning, and updates a test to use an additional parameter in the pytket conversion function.

Changes

Cohort / File(s) Summary
Type System Updates
src/mqt/predictor/rl/actions.py
Updated do_while callable signature from Callable[[dict[str, Circuit]], bool] to Callable[[PropertySet], bool] with added PropertySet import under TYPE_CHECKING.
Environment State Validation
src/mqt/predictor/rl/predictorenv.py
Added three new methods: is_circuit_laid_out (checks if all logical qubits assigned to physical qubits), is_circuit_routed (validates directed two-qubit gates on device coupling map), and determine_valid_actions_for_state (returns valid action list based on circuit state). Replaced hard assertion in VF2Layout postprocessing with warning. Updated imports to include Layout and removed CheckMap.
Training Configuration
src/mqt/predictor/rl/predictor.py
Set random seed to 0 in non-test training path to ensure deterministic behavior alongside existing test-mode seeding.
Test Updates
tests/compilation/test_integration_further_SDKs.py
Updated test_tket_routing to call tk_to_qiskit with perm_warning=False parameter.

Sequence Diagram(s)

sequenceDiagram
    participant Agent as RL Agent
    participant Env as PredictorEnv
    participant Circuit as QuantumCircuit
    participant Layout as Layout Object
    participant CouplingMap as Device CouplingMap
    
    Agent->>Env: Request valid actions for current state
    Env->>Circuit: Check circuit properties
    Env->>Layout: Retrieve layout information
    Env->>Env: is_circuit_laid_out(circuit, layout)
    rect rgba(100, 150, 200, 0.5)
        Note over Env: Verify all logical qubits<br/>mapped to physical qubits
    end
    Env->>CouplingMap: Get device coupling map
    Env->>Env: is_circuit_routed(circuit, coupling_map)
    rect rgba(100, 150, 200, 0.5)
        Note over Env: Verify directed 2-qubit gates<br/>respect device topology
    end
    Env->>Env: determine_valid_actions_for_state()
    rect rgba(100, 150, 200, 0.5)
        Note over Env: Synthesize valid actions based on<br/>synthesis, layout, routing, mapping states
    end
    Env->>Agent: Return list of valid action indices
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 The circuit takes its rightful place,
With validation at each trace,
State checks bloom where assertions fell,
PropertySet's tale to tell,
Our layout dances, routed true! 🎭

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title uses emoji and is too vague; 'Fix RL Training and Improve Structure' is generic and doesn't clearly convey the main changes or scope of the PR. Replace with a more descriptive, emoji-free title that specifically mentions the key fixes, such as: 'Fix action validation state machine and type annotations in RL training'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The description comprehensively covers the key changes across all modified files, explains the motivation for fixes, and includes details about bug resolutions and improvements. However, the checklist items are not explicitly marked as completed.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bugfix

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 435-447: The current is_circuit_laid_out only checks qubits that
appear in instructions and misses idle logical qubits; update
is_circuit_laid_out (and its TranspileLayout handling) to validate every logical
qubit from the circuit (e.g., iterate circuit.qubits or
range(circuit.num_qubits)) against layout.get_virtual_bits() instead of
iterating only instr.qubits, and return False if any circuit qubit is not
present in v2p; keep the existing fallback to layout.final_layout or
layout.initial_layout and handle missing/None layout gracefully.
- Around line 360-363: Replace the eager string concatenation in the VF2Layout
check with logger-style formatting: instead of adding
pm.property_set["VF2Layout_stop_reason"] to the message, call logger.warning
with a format string and pass pm.property_set["VF2Layout_stop_reason"] as a
separate argument (referencing VF2Layout_stop_reason, VF2LayoutStopReason,
pm.property_set and logger.warning in predictorenv.py) so the message is
constructed lazily and satisfies Ruff G003.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26e3b01 and 825773f.

📒 Files selected for processing (4)
  • src/mqt/predictor/rl/actions.py
  • src/mqt/predictor/rl/predictor.py
  • src/mqt/predictor/rl/predictorenv.py
  • tests/compilation/test_integration_further_SDKs.py

Comment thread src/mqt/predictor/rl/predictorenv.py
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment on lines +435 to +447
def is_circuit_laid_out(self, circuit: QuantumCircuit, layout: TranspileLayout | Layout) -> bool:
"""True if every logical qubit in the circuit has a physical assignment."""
if isinstance(layout, TranspileLayout):
# Use final_layout if available; otherwise fallback to initial_layout
layout = layout.final_layout or layout.initial_layout

if not only_nat_gates:
actions = self.actions_synthesis_indices + self.actions_opt_indices
if self.layout is not None:
actions += self.actions_routing_indices
return actions
v2p = layout.get_virtual_bits()
for instr in circuit.data:
for q in instr.qubits:
if q not in v2p:
# Logical qubit not assigned
return False
return True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate all logical qubits, not just those appearing in instructions.
The docstring says “every logical qubit”; idle qubits are currently skipped.

🔧 Suggested fix
-        for instr in circuit.data:
-            for q in instr.qubits:
-                if q not in v2p:
-                    # Logical qubit not assigned
-                    return False
+        for q in circuit.qubits:
+            if q not in v2p:
+                # Logical qubit not assigned
+                return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mqt/predictor/rl/predictorenv.py` around lines 435 - 447, The current
is_circuit_laid_out only checks qubits that appear in instructions and misses
idle logical qubits; update is_circuit_laid_out (and its TranspileLayout
handling) to validate every logical qubit from the circuit (e.g., iterate
circuit.qubits or range(circuit.num_qubits)) against layout.get_virtual_bits()
instead of iterating only instr.qubits, and return False if any circuit qubit is
not present in v2p; keep the existing fallback to layout.final_layout or
layout.initial_layout and handle missing/None layout gracefully.

@flowerthrower
Copy link
Copy Markdown
Collaborator Author

Problem

After a Qiskit layout pass (e.g., DenseLayout), the MDP state became:

  • synthesized = True
  • laid_out = True
  • routed = False

determine_valid_actions_for_state() correctly allowed only routing actions.

However, action_masks() filtered out all TKET actions because self.layout was not None.
Since the only registered routing action was RoutingPass (from TKET), this removed the last available routing option and produced an all-False action mask.

Consequence in MaskablePPO

sb3-contrib's MaskablePPO handles an all-False mask by setting all logits to a large negative number instead of negative infinity.

Applying softmax to these equal values results in a uniform distribution, meaning every action — including terminate — had a probability of about 1/22 (~4.5%).

During training with 1000 timesteps, the probability that terminate would never be sampled is extremely small (<0.003%), making a crash almost inevitable.

This never came up in tests that only executed 100 steps.

Crash Mechanism

When the agent sampled terminate in this deadlocked state:

  1. step() re-evaluated determine_valid_actions_for_state()
  2. It correctly detected routed = False
  3. reward calculation failed

@flowerthrower flowerthrower changed the base branch from main to qce-experiments March 11, 2026 08:12
@flowerthrower flowerthrower added bug Something isn't working refactor PR or issues that refactor code labels Mar 11, 2026
@mergify
Copy link
Copy Markdown

mergify Bot commented Mar 11, 2026

🧪 CI Insights

Here's what we observed from your CI run for 07e7338.

🟢 All jobs passed!

But CI Insights is watching 👀

@flowerthrower flowerthrower marked this pull request as ready for review March 11, 2026 09:24
Copilot AI review requested due to automatic review settings March 11, 2026 09:24
@flowerthrower
Copy link
Copy Markdown
Collaborator Author

This PR will be merged into the qce-experiments branch with failing windows tests. The rational here is that all other platforms are succeeding and we want to make fast progress on the experiments branch. Furthermore, we have a strong suspicion that the problem will be resolved by subsequent experiment PRs anyway.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the reinforcement-learning compilation pipeline to better track layout/routing state, improves Qiskit/TKET interop behavior, and refines training/reward calculation logic to support more robust RL-based compilation.

Changes:

  • Extend PredictorEnv with an mdp mode and new synthesized/layouted/routed state checks to drive valid-action selection.
  • Update TKET→Qiskit conversions to explicitly replace implicit swaps (env + integration test).
  • Adjust RL training defaults (esp. for tests) and simplify reward qubit indexing via QuantumCircuit.find_bit().

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/mqt/predictor/rl/predictorenv.py Adds mdp mode + new action-validity logic; updates layout/routing handling and unitary decomposition behavior.
src/mqt/predictor/rl/predictor.py Tweaks test training hyperparameters and makes seeding deterministic in non-test training.
src/mqt/predictor/rl/actions.py Updates DeviceDependentAction.do_while typing to match Qiskit PropertySet.
src/mqt/predictor/reward.py Uses find_bit() for qubit indices and adds a clearer error message for missing 2Q gate calibration entries.
tests/compilation/test_predictor_rl.py Increases RL training timesteps used in the integration-style test that generates models.
tests/compilation/test_integration_further_SDKs.py Updates TKET routing test conversion to replace implicit swaps.
Comments suppressed due to low confidence (1)

src/mqt/predictor/rl/predictor.py:112

  • set_random_seed(0) is now called unconditionally in the non-test training path, which forces deterministic training for all normal usage. This is a behavioral change that may be surprising for users expecting stochastic training runs. Consider making the seed configurable (parameter/env var) or limiting the hard-coded seed to test=True only.
        else:
            set_random_seed(0)
            # default PPO values
            n_steps = 2048
            n_epochs = 10
            batch_size = 64
            progress_bar = True

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +100 to +102
logger.info("MDP: " + mdp)
self.mdp = mdp

Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mdp is accepted as an arbitrary string, but later logic assumes only "paper" or "flexible". If any other value is passed, determine_valid_actions_for_state() can return an empty list and step() will raise RuntimeError("No valid actions left."). Consider validating mdp in __init__ (e.g., Literal/Enum + ValueError on unknown values) to fail fast with a clear error.

Copilot uses AI. Check for mistakes.
Comment thread src/mqt/predictor/rl/predictorenv.py Outdated
Comment on lines +456 to +460
for instr in circuit.data:
for q in instr.qubits:
if q not in v2p:
# Logical qubit not assigned
return False
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_circuit_laid_out() claims to verify that every logical qubit has a physical assignment, but it only checks qubits that appear in instructions (for instr in circuit.data). This can return True even if some idle logical qubits in circuit.qubits are not present in the layout. If you really want “every logical qubit”, iterate over circuit.qubits (or otherwise ensure the layout covers all qubits), not just those used by operations.

Suggested change
for instr in circuit.data:
for q in instr.qubits:
if q not in v2p:
# Logical qubit not assigned
return False
for q in circuit.qubits:
if q not in v2p:
# Logical qubit not assigned
return False

Copilot uses AI. Check for mistakes.
Comment on lines +518 to +525
# Initial state
if not synthesized and not laid_out and not routed:
if self.mdp == "flexible":
actions.extend(self.actions_synthesis_indices)
actions.extend(self.actions_mapping_indices)
actions.extend(self.actions_layout_indices)
actions.extend(self.actions_opt_indices)
if self.mdp == "paper":
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new mdp modes, the initial-state policy in determine_valid_actions_for_state() can include mapping/layout actions (for mdp == "flexible"), but reset() still initializes self.valid_actions to opt + synthesis only. This means the action mask at the first decision step ignores the mdp setting. Consider setting self.valid_actions = self.determine_valid_actions_for_state() during reset() so the first action is consistent with the selected MDP.

Copilot uses AI. Check for mistakes.

predictor.train_model(
timesteps=100,
timesteps=1000,
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test now trains with timesteps=1000 (previously 100). Since train_model() runs PPO rollouts and is executed during the test suite, this can significantly increase CI runtime and make the suite flaky/time out. Consider keeping timesteps minimal for CI (and/or gating longer training behind a pytest.mark.slow or an env flag) while still asserting the intended behavior.

Suggested change
timesteps=1000,
timesteps=100,

Copilot uses AI. Check for mistakes.
@flowerthrower flowerthrower merged commit 90ec2cf into qce-experiments Mar 11, 2026
10 of 11 checks passed
@flowerthrower flowerthrower deleted the bugfix branch March 11, 2026 11:24
@coderabbitai coderabbitai Bot mentioned this pull request Apr 13, 2026
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working refactor PR or issues that refactor code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants