architecture/appendix/rationale.md at dev · OpenVoiceOS/architecture

← APPENDIX.md · Non-normative

⚠️ AI-generated draft — not yet fully reviewed. This content was produced by a large language model (Claude Code) and has not yet been fully reviewed for accuracy, completeness, or consistency with the specifications. The normative specifications themselves are human-reviewed; this appendix is supplementary context. Readers should verify claims before relying on them.

4. Design rationale, per specification

Short notes on why the specifications make the choices they do — the reasoning, not the requirement. Cross-reference into the normative sections.

4.1 Intent grammar and resources (INTENT-1, -2, -3)

ASR-normalized input, no escaping (INTENT-1 §2). The grammar targets voice input. By contract, text reaching an engine is already lowercased, punctuation-stripped, single-spaced. Bracket metacharacters therefore cannot occur as literal input, so no escape mechanism is needed. A simplification bought by scoping the grammar to voice.
Templates are training data (INTENT-1 §4). Enumerating every phrasing is futile for natural speech. A template describes the shape of the training data; the engine generalizes. This is why expansion is defined precisely but matching is not.
An intent is not an event (INTENT-3 §1). Necessary for an open skill ecosystem — see §2.2.
Two non-interoperable methods (INTENT-3 §2). Keyword and template intents describe a command in fundamentally different shapes. Rather than forcing one model, the spec keeps both and makes engines declare which they accept. The cost is that a developer must choose per intent and know which engines an installation runs.
Slot typing is deferred (INTENT-1 §5.3). Interpreting a slot value as a number or date is inseparable from how ASR output is normalized — and normalization is not yet specified. Specifying typing first would be incoherent, so a value is, for now, an opaque sequence of words.
.blacklist vs excluded (INTENT-3 §4.2, §5.4). The template grammar is purely generative — it cannot express "not this". Template intents therefore need a separate .blacklist artifact for suppression. Keyword intents express the same idea natively with the excluded constraint role. The asymmetry follows from the grammar, not from inconsistency.
No regular expressions (INTENT-3 §4.4). Free-form structured text is a slot — use a template intent and the slot extractor. Regexes are also notoriously hard to localize, which conflicts with the per-language model.
Inline vocabulary references reuse .voc (INTENT-1 §3.7). A reusable template fragment and a keyword vocabulary are the same thing — a named, slot-free phrase set — so <name> resolves to a .voc rather than introducing a new file role. The change is one grammar token plus an expander step.

4.2 Bus message envelope (MSG-1)

One spec, not two. Envelope + routing + derivations are tightly coupled — every routing key lives in context, every derivation manipulates routing, and all of them formalize existing OVOS code. Splitting them was tried; the split did not survive the derivations (which can only meaningfully be defined where the routing keys are), so they were merged into a single bus-message spec. The session carrier, by contrast, did split out cleanly into OVOS-SESSION-1.
context is extensible by design. Only the keys other systems already key behaviour off (source, destination, session) are given normative meaning. Everything else — GUI routing, tracing, security — is layered by other specs without touching the envelope.
source/destination are informational, not authorization (MSG-1 §3.3). The bus is not a security boundary. Layer-2 systems (HiveMind) build authentication and routing enforcement on top of the pair without OVOS itself learning about peers.
The boundary is user ↔ assistant, not core ↔ handler. The (source, destination) pair marks who is currently talking to whom across one boundary only: the external participant on one side, the assistant — core and every skill handler — on the other. The flip happens once per conversational turn (§3.1.1), not on every internal hop.
No central correlation, no central state (MSG-1 §5.4, §3.1.2 above). The bus is fully asynchronous. Components that need correlation or state own it themselves, keyed on session.session_id. Multi-turn conversation, intent context, cross-skill state, and similar concerns are deferred to other specifications.
Topic naming conventions (MSG-1 v2 §2.1.2). The conventions other specs in the family already follow are now codified as SHOULD-rules: dot-separated hierarchy with : reserved for component-pair shapes; stable ecosystem-identifying root; verb-tense pattern for the trailing segment; request/terminal pairs sharing a root verb (handle ↔ handled); .response suffix for response derivations; per-instance <root>.<domain>.<id>.<verb> form.

4.3 Session carrier (SESSION-1)

Why a separate session spec. Message.context.session is a load-bearing carrier claimed by multiple specs (PIPELINE-1, CONTEXT-1, TRANSFORM-1) — without a single owner, its wire contract drifts. SESSION-1 consolidates the wire shape and fixes a registry mechanism so future specs claim fields without amending SESSION-1 itself.
Prescriptive, not descriptive. Only the fields normatively claimed by other specs are recognized. Implementations carrying extra per-session state (current OVOS Session has persona_id, system_unit, time_format, date_format, location, is_speaking, is_recording, …) are non-normative under v1 — they ride through as opaque pass-through and can be claimed by future per-domain specs.
Omission means "let the orchestrator decide". Single deferral mechanism: omitted single field, empty session: {}, absent session, explicit session_id: "default" — all equivalent on the wire, all resolve at consumption to deployment defaults filled by each consumer. No null, no sentinels.
Language signals. Six BCP-47 fields with normative meanings but stage-dependent consolidation: lang (user preference, base), secondary_langs (additional understood languages, constrains lang-detect predictions and fallback selection), output_lang (renderer's preferred output language; simplifies the bidirectional-translation transformer to a fallback role), stt_lang / request_lang / detected_lang (per-utterance signals from STT, emitter, and lang-detect respectively). request_lang is an emitter-reported hint (per-wakeword language assignment in multi-wakeword setups), not an override.

4.4 Intent registration broadcast (INTENT-4)

Registrations are broadcast — already how OVOS works. Skills emit registration messages on the bus; plugins that care about a particular registration kind subscribe to the corresponding topic. There has never been a central routing party in OVOS; INTENT-4 just gives this existing model normative topic names. The legacy bus topics (padatious:register_intent, register_vocab, etc.) are renamed into the ovos.intent.* namespace — see §5.7 for the mapping. Migration is mostly a string replacement.
No "no plugin claimed" error. Following from the broadcast model: a registration that no plugin consumes is silently dropped. The producer gets no signal — the introspection topics (ovos.intent.list / ovos.intent.describe) are the supported way to verify what the orchestrator's passive index recorded.
The orchestrator passively indexes; it does not gate. The introspection topics serve from a passive registration index built by listening to broadcasts (this is new — current OVOS has no central index). The index reflects what skills declared, not what plugins actually match against — observability-only.
Skill self-identification on every emission (INTENT-4 §3.1). Every Message a skill emits or modifies in place carries Message.context["skill_id"]. Enforcement is structural on the dispatch path: the orchestrator stamps context.skill_id from the <skill_id>:<intent_name> dispatch topic prefix (PIPELINE-1 §7.1), and skill emissions via forward/reply inherit automatically.

4.5 Pipeline and lifecycle (PIPELINE-1)

The plugin model is already in place; PIPELINE-1 refines it (§3.2). The current orchestrator already loads plugins by id through OVOSPipelineFactory and iterates Session.pipeline. PIPELINE-1 tightens the contract rather than introducing the abstraction.
Orchestrator and plugin contracts live in one spec, since the orchestrator's job is iterating plugins and translating their matches into bus events. Splitting them would leave neither coherent.
Plugin contract is minimal. match(utterances, lang, session) → Match | None. Side-effect-free during match; everything else (state, registrations, language-model calls, response generation) is plugin-internal black box. The smaller the contract, the wider the set of plugins it accommodates.
lang parameter is propagation-only. The orchestrator passes lang through from Message.data.lang; it MUST NOT synthesize a value from session.lang or any per-utterance signal field when data.lang is absent. Absence is a faithful "unknown" signal; consumer-side fallback policy is the consumer's.
Tier conventions are out of scope. The current high / medium / low suffix is implementation strategy: from the bus, each tier is already a distinct pipeline_id in Session.pipeline. The current convention is compatible with PIPELINE-1 unchanged.
Skills and plugins are equivalent handler owners. The dispatch topic <skill_id>:<intent_name> is uniform: for a pure-matcher plugin the skill_id is the matched skill's id; for a plugin that bundles its own handler (e.g. a language-model persona) skill_id == pipeline_id. Both are addressed the same way.
Universal ovos.utterance.handled end-marker on every terminal path. One reserved invariant lets observers count turns, route fallbacks, and know "the assistant is idle now" without per-stage knowledge.
Three-stage composition (PIPELINE-1 §5.5) — preference (from session.pipeline or default-session pipeline) → availability (drop unloaded plugins) → policy (drop denylisted). Mirrors TRANSFORM-1 §5.3 exactly. The same shape supports the client-requests/layer-2-enforces split (§3.1).

4.6 Intent context (CONTEXT-1)

Lifts intent context out of Adapt. The Adapt-specific add_context / remove_context mechanism, and the legacy mycroft.skill.set_cross_context / remove_cross_context fan-out for cross-skill use, are Adapt-only at the matcher level — Padatious and other engines ignore them. CONTEXT-1 generalizes the mechanism into a session-bound, decaying flat key/value store consumed by every intent engine uniformly via requires_context and excludes_context declarations.
Two explicit scopes encoded in the key shape. private (orchestrator auto-prefixes with <skill_id>:) and shared (flat, cross-skill). The current OVOS code models the same distinction informally (MycroftSkill.set_context auto-prefixes with alphanumeric_skill_id; set_cross_skill_context fans out via a bus event); CONTEXT-1 names the scopes explicitly and routes both through one bus surface.
Why private is the default. A skill that calls ovos.context.set without specifying scope gets a private entry. This optimises for the safer case: a cross-skill leak from an accidentally-shared entry is harder to debug than a cross-skill miss from an accidentally-private entry. The current Adapt set_context pattern is effectively skill-private; the default preserves migration fidelity. Cross-skill coordination is a conscious decision that deserves an explicit scope: "shared".
Prior art for the negative gate. Three in-tree intent engines under /plugins-pipeline/ — jurebes, nebulento, and palavreado — independently implement exclude_context as a first-class negative gate. CONTEXT-1's excludes_context adopts the same primitive at the spec level, addressing patterns ("fire once", "modal suppression") that positive gating alone cannot express.
Engine-side mutation as a sanctioned non-bus pathway. The Adapt pipeline plugin auto-injects matched entities into context inside match(), which conflicts with PIPELINE-1 §4.2's side-effect-free match rule. CONTEXT-1 §5.3 carves an explicit window between match-accept and dispatch-emit for engine-side session mutation, with the orchestrator (not the bus) carrying the write. This both legitimizes the established practice and resolves the PIPELINE-1 contradiction.
Eight-level lifecycle-position owner precedence (CONTEXT-1 §5.2). When a Message carries multiple component-identity keys (skill_id, pipeline_id, the six <type>_transformer_ids) from a derivation chain that crossed component boundaries, the orchestrator picks the owner by lifecycle position: the latest stage to run is the most specific.

4.7 Transformer plugins (TRANSFORM-1)

Spec'd as an architectural pattern, not a feature list. An orchestrator MAY implement chains at any subset of six injection points (audio, utterance, metadata, intent, dialog, TTS); a null-implementation is conformant. For each chain it does implement, the per-type contract binds. Each injection point's existence is justified by what the lifecycle holds at that exact moment — what's possible there that isn't possible elsewhere.
Intent transformers as the system-typing home. INTENT-1 §5.3 defers slot value typing pending a text normalization specification. TRANSFORM-1 §3.4 is the spec'd injection home for typing: a deployer ships date / number / duration parsing once, and every skill receives typed values in Match.slots regardless of which engine matched. The OVOS analogue of ASK's AMAZON.DATE and Dialogflow's @sys.date-time, but as an injected enrichment rather than a built-in engine feature.
Concrete in-tree plugins as prior art. Nine plugins live under /plugins-transformer/ today, covering five of the six injection points: utterance transformers (ovos-utterance-normalizer, ovos-utterance-corrections-plugin, ovos-transcription-validator-plugin, ovos-utterance-plugin-cancel, ovos-bidirectional-translation-plugin); dialog transformers (ovos-dialog-normalizer-plugin, ovos-bidirectional-translation-plugin, ovos-dialog-transformer-openai-plugin); audio transformers (ovos-audio-transformer-plugin-speechbrain-langdetect, ovos-audio-transformer-plugin-ggwave, ovos-audio-transformer-redis-publish); intent transformers (ovos-keyword-template-matcher, ovos-ahocorasick-ner-plugin). The bidirectional-translation plugin exercises the cross-chain coordination via Message.context that TRANSFORM-1 §7 formalizes.
Ascending priority. TRANSFORM-1 §4 specifies ascending priority (lower = earlier, default 50). Current OVOS sorts transformer chains descending (ovos_core/transformers.py:53,117,205, reverse=True); the spec aligns with the ascending convention already used by fallback skills (fallback_service.py:49, default 101 = run last) and the natural "stages count up" reading. Bringing current plugins into conformance only requires flipping relative priorities, not rewriting.
Cancellation aligned with prior plugin convention. Two existing utterance transformers (ovos-utterance-plugin-cancel, ovos-transcription-validator-plugin) already signal the lifecycle should abort by returning empty utterance lists with {canceled: true, cancel_word: <reason>} context keys. TRANSFORM-1 §8 keeps the convention, renaming cancel_word to cancel_reason (the structured concept the field encodes) and adding orchestrator-stamped cancel_by: <transformer_id>. The spec's ovos.utterance.cancelled terminal event sits alongside ovos.intent.unmatched, keeping cancellation and failure observably distinct on the bus.
lang parameter is bidirectional (TRANSFORM-1 §3.0). Four of the six per-type contracts (audio, utterance, dialog, TTS) take lang as input and return it as output. A bidirectional-translation transformer that takes Spanish in and produces English out returns the destination language; the orchestrator writes the chain's final lang back into Message.data.lang for downstream stages. Language-detector and clearing cases fall out of the same channel.
Per-type self-identification keys, list-valued. TRANSFORM-1 §1.3 claims six Message.context keys (one per transformer type) rather than a single generic key. Role matters: a Message may have been touched by multiple types in sequence, and a multi-type plugin (e.g., both utterance and dialog) would be ambiguous in a single-key model. Keys are lists because transformers chain — the full per-type chain is preserved in order.
Per-type denylists complete the policy surface. TRANSFORM-1 §5.2 claims six blacklisted_<type>_transformers session fields, paralleling the six <type>_transformers chain-ordering fields of §5.1 and the pipeline / blacklisted_pipelines pair of PIPELINE-1 §5. Three-stage composition (preference → availability → policy) in §5.3 mirrors PIPELINE-1 §5.5 exactly.
The per-type "explosion" is deliberate. Twelve flat session fields (six chain-orderings + six denylists) plus six Message.context attribution keys. A prefix-encoded single namespace would require prefix parsing at every lookup; the per-type partition matches the existing registry and chain-ordering structure. Under SESSION-1 §3.4's SHOULD-omit rule the common case carries zero of these on the wire.
Language signals live in SESSION-1. Language signals (stt_lang, request_lang, detected_lang, alongside lang, secondary_langs, output_lang) are session-scoped fields with normative meanings but a non-binding consolidation order — the right priority is stage-dependent. TRANSFORM-1 §7.1 names which transformer types are natural producers of which signals; consolidation is the consumer's decision per SESSION-1 §3.2.7.

4.8 Stop pipeline plugin (STOP-1)

The most common reader question on first encountering STOP-1 is why a pipeline plugin and not a skill. Stop sounds like an ordinary intent: a user utterance ("stop", "cancel") matched and handled. A skill that registers a stop intent and implements a stop handler looks like the obvious shape. STOP-1 deliberately lifts stop into the pipeline layer instead, and the reasons are load-bearing — a skill cannot implement the cascade defined in STOP-1 §4 even in principle.

Pre-emption requires evaluation-layer ordering control, not handler-layer dispatch. Stop's defining property is that it pre-empts every other matching stage — active converse polls, response-mode delivery, ordinary intent matching. Pipeline plugins are evaluated in declared order with first-match-wins; STOP-1 §7 positions the stop plugin first so it gets the first opportunity to claim every utterance. A skill's intent handler runs only after intent matching has already selected it, by which point converse and intent matchers have already had their say. The escape-hatch property lives at the pipeline-iteration layer, not the handler layer; a skill is at the wrong layer to own it.

The cascade target is decided before dispatch. STOP-1 §4.1 consults session.active_handlers, performs the ping-pong filter, picks the most recently activated responder by activated_at, and emits a Match whose skill_id is the chosen target. The orchestrator then dispatches <target>:stop directly using its ordinary routing rules. A skill matching stop utterances would itself become the dispatch target, and would then have to re-emit synthetic dispatches at other skills — bypassing the orchestrator's routing model and losing the standard handler-lifecycle trio for the actual stop. Match-phase target selection is what reduces the cascade to a single clean PIPELINE-1 dispatch instead of two-step orchestration.

Match.updated_session carries the post-stop session state. STOP-1 §6.2 requires the stopped handler to be removed from active_handlers via Match.updated_session so the cleared state propagates through the rest of the utterance lifecycle. Skills have no Match to mutate; their handlers receive the dispatch session and may mutate it from within the handler boundary, but cannot communicate session changes that apply to the dispatch itself.

The reserved-name authority lives at the spec / pipeline layer. STOP-1 §2 reserves stop across every OVOS-INTENT-4 registration in the deployment, enforced by the orchestrator's malformed-payload treatment of competing registrations. The authority to define what stop means globally — and to police skill-level attempts to claim the name — cannot live inside any single skill that itself uses the name.

Confidence-tier interleaving is a pipeline-ordering concern. STOP-1 §7 describes stop_high / stop_medium / stop_low interleaved with other pipeline plugins of comparable confidence. A skill has no analogous handle on inter-stage ordering; intent confidence is consumed by intent matchers, not by the outer pipeline that decides which matcher runs first.

The two layers cooperate by design. A skill MAY — and per STOP-1 §9 SHOULD — provide its own stop handler: every skill that participates in the cascade implements a stop intent handler subscribed to <own_skill_id>:stop. The pipeline plugin matches and selects; the skill stops. Stop is one of the few cases in the spec set where the pipeline / skill split is not substitutable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4. Design rationale, per specification

4.1 Intent grammar and resources (INTENT-1, -2, -3)

4.2 Bus message envelope (MSG-1)

4.3 Session carrier (SESSION-1)

4.4 Intent registration broadcast (INTENT-4)

4.5 Pipeline and lifecycle (PIPELINE-1)

4.6 Intent context (CONTEXT-1)

4.7 Transformer plugins (TRANSFORM-1)

4.8 Stop pipeline plugin (STOP-1)

FilesExpand file tree

rationale.md

Latest commit

History

rationale.md

File metadata and controls

4. Design rationale, per specification

4.1 Intent grammar and resources (INTENT-1, -2, -3)

4.2 Bus message envelope (MSG-1)

4.3 Session carrier (SESSION-1)

4.4 Intent registration broadcast (INTENT-4)

4.5 Pipeline and lifecycle (PIPELINE-1)

4.6 Intent context (CONTEXT-1)

4.7 Transformer plugins (TRANSFORM-1)

4.8 Stop pipeline plugin (STOP-1)