OMPE-95818: Fix bimodal Dexsuite-Lift success rate#6155
Conversation
Kuka-Allegro-Lift converged to either ~0.0 or ~0.8 success depending on seed. The reward landscape had gradient gaps: a narrow reach kernel and farmable constant grip reward that made holding the cube in place a plateau. Reshaped the reward so progress toward the goal is the dominant, unfarmable signal: - Add delivery_progress (weight 6): contact-gated reward for net progress of the object toward the commanded goal; unfarmable (clamped, latched). - Fix the approach desert on the fingers-to-object reach reward: widen its tanh kernel (std 0.4→0.8) and raise its pre-grasp reach credit via a new no_contact_scale param on object_ee_distance (default 0.1, behaviour unchanged elsewhere; 0.3 here). - Widen the object-to-goal position-tracking kernel (std 0.2→0.5) to broaden the gradient near the goal. - Decay grip rewards after first grasp (floor 0.3 over 200 steps) so holding without delivering is no longer farmable. - Add pure-logic CPU unit tests for the decay and delivery_progress terms (latch, decay-to-floor, no-regrip-reset, contact gating, d0 capture/refresh).
Greptile SummaryThis PR reshapes the Dexsuite Kuka-Allegro-Lift reward to eliminate bimodal convergence (seeds collapsing to ~0 or ~0.8 success). The core insight is that constant grip rewards created an unfarmable plateau; the fix introduces a dominant contact-gated delivery-progress signal and time-since-grasp decay on all grip rewards.
Confidence Score: 4/5Safe to merge; the new reward terms are mathematically correct and well-tested, with changes confined to the Dexsuite Kuka-Allegro reward stack. The decay latch, floor clamp, no-regrip-reset, and d0 capture/refresh logic are all implemented correctly and covered by unit tests. The one open question is that d0 is fixed at episode start before any contact, so if the object drifts away from the goal during approach the dominant delivery_progress reward stays dark until the object re-crosses the original threshold. Configuration wiring follows established patterns and the init.pyi stub is kept in sync. rewards.py — specifically the d0 capture logic in delivery_progress.call and the double contacts() call in object_ee_distance_decay, both of which are correct but merit a second read. Important Files Changed
|
| # (Re)capture d0 on reset or goal resample. Updated unconditionally (vectorised) to avoid a | ||
| # per-step host-device sync; unflagged envs keep their stored d0/goal via the masked where. | ||
| capture = self._need_capture | (torch.linalg.norm(des_pos_w - self._goal_w, dim=1) > 1e-4) | ||
| self._d0 = torch.where(capture, d.clamp(min=1e-3), self._d0) | ||
| self._goal_w = torch.where(capture.unsqueeze(-1), des_pos_w, self._goal_w) | ||
| self._need_capture &= ~capture | ||
| progress = ((self._d0 - d) / self._d0).clamp(0.0, 1.0) | ||
| return progress * contacts(env, contact_threshold, thumb_name, finger_names).float() |
There was a problem hiding this comment.
d0 is captured contact-free, so a pre-grasp object bump can permanently raise the progress baseline
d0 is (re)captured unconditionally on reset or goal resample before any contact occurs. If during the approach phase the robot accidentally nudges the object away from the goal, d exceeds d0, causing progress = clamp((d0-d)/d0, 0, 1) = 0 until the object is returned past the original d0 threshold. The dominant delivery_progress (weight 6) stays dark until then, potentially creating a reward dead zone. Consider capturing d0 only at first contact, or refreshing it each step while contacts == False. Was the decision to fix d0 at episode start (rather than at first-contact) intentional, knowing that pre-grasp object displacement could create a zero-reward zone before d < d0?
| term = rewards.good_finger_contact_decay.__new__(rewards.good_finger_contact_decay) | ||
| term._t_grasp = torch.full((num_envs,), -1, dtype=torch.long) | ||
| return term |
There was a problem hiding this comment.
__new__ bypass skips ManagerTermBase.__init__ and future subclass init logic
All decay-term tests construct instances via ClassName.__new__(ClassName) and manually populate _t_grasp / _d0 / etc., sidestepping ManagerTermBase.__init__. If initialization logic is ever added to _GraspAgeDecay.__init__ or delivery_progress.__init__, the tests will silently not exercise it. Adding a smoke-test that calls the real __init__ with a minimal stub config and env would make the suite more future-proof.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Description
Kuka-Allegro-Lift converged to either ~0.0 or ~0.8 success depending on seed. The reward landscape had gradient gaps: a narrow reach kernel and farmable constant grip reward that made holding the cube in place a plateau. Reshaped the reward so progress toward the goal is the dominant, unfarmable signal:
delivery_progress(weight 6): contact-gated reward for net progress of the object toward the commanded goal; unfarmable (clamped, latched).fingers-to-objectreach reward: widen its tanh kernel (std 0.4 to 0.8) and raise its pre-grasp reach credit via a newno_contact_scaleparam onobject_ee_distance(default 0.1, behavior unchanged elsewhere; 0.3 here).delivery_progressterms (latch, decay-to-floor, no-regrip-reset, contact gating, d0 capture/refresh).Type of change
Screenshots
Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists there