Skip to content

rfcs: add RFC-18 link classification flex-algo#3288

Open
ben-malbeclabs wants to merge 2 commits intomainfrom
bc/link-classification-flex-algo
Open

rfcs: add RFC-18 link classification flex-algo#3288
ben-malbeclabs wants to merge 2 commits intomainfrom
bc/link-classification-flex-algo

Conversation

@ben-malbeclabs
Copy link
Contributor

@ben-malbeclabs ben-malbeclabs commented Mar 16, 2026

Summary

RFC-18 introduces a link classification model for DoubleZero using IS-IS
Flexible Algorithm (flex-algo). DZF assigns named color labels to links
onchain; the controller translates these into IS-IS TE admin-groups and
flex-algo topology definitions on Arista EOS devices. BGP color extended
communities steer VPN unicast traffic onto constrained topologies, while
multicast continues to use all links via IS-IS algo 0.

What this RFC specifies:

  • LinkColorInfo onchain account — defines a color with auto-assigned
    admin-group bit (from a new AdminGroupBits ResourceExtension),
    flex-algo number, EOS color value, and include/exclude constraint
  • link_colors: Vec<Pubkey> on the Link account — assigns one or more
    colors to a link; controller renders all assigned colors as a single
    overwrite traffic-engineering administrative-group command
  • include_topology_colors: Vec<Pubkey> on the Tenant account — assigns
    specific topology colors to a tenant; defaults to color 1
    (UNICAST-DEFAULT) if unset
  • Controller features.yaml — gates flex-algo topology config, link
    tagging, and BGP color community stamping independently for staged rollout
  • Full revert: enabled: false removes all flex-algo config from all devices
  • Migration command for existing Vpn4v loopbacks to allocate
    flex_algo_node_segment_idx; controller blocks enablement until complete

Introduces onchain link color model using IS-IS Flexible Algorithm
(RFC 9350) to separate VPN unicast and multicast forwarding topologies.
Defines LinkColorInfo PDA, link_color field on Link, FlexAlgo feature
flag, and controller changes for admin-group tagging, flex-algo
definitions, system-colored-tunnel-rib BGP resolution, and per-tunnel
color extended community stamping.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DZF creates a `LinkColorInfo` PDA per color. It stores the color's name and auto-assigned routing parameters. The program MUST auto-assign the next available admin-group bit (starting at 0) and the corresponding flex-algo number and EOS color value using the formula:

```
admin_group_bit = next available bit in 0–127
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to define a tracking mechanism for admin-group bits; a persistent bitmap or counter on GlobalState or something. Otherwise we'd have to scan all existing LinkColorInfo accounts at instruction time.


The program MUST validate `admin_group_bit <= 127` on `create` and MUST return an explicit error if all 128 slots are exhausted. This is a hard constraint: EOS supports bits 0–127 only, and `128 + 127 = 255` is the maximum representable value in `flex_algo_number: u8`.

Admin-group bits from deleted colors MUST NOT be reused. Color deletion is not supported in this RFC, so this constraint applies to any future deletion implementation: reusing a bit before all devices have had their config updated would cause those devices to apply the new color's constraints to interfaces still carrying the old bit's admin-group. At current scale (128 available slots), exhaustion is not a practical concern.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once delete removes the PDA (as in line#160), we cannot enforce the no-reuse requirement without a persistent record of previously allocated bits that would survive PDA deletion.

**Scope:**
- Delivers traffic-class-level segregation: multicast vs. VPN unicast at the network level
- All unicast tenants share a single constrained topology today — the architecture is forward-compatible with per-tenant path differentiation without rework
- Per-tenant steering (directing one tenant to a different constrained topology) requires adding a `topology_color` field to the `Tenant` account — deferred to a future RFC that builds on the link color model defined here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If in the future we want to allow a tenant to have their traffic avoid a link color, then topology_color should maybe be called include_topology_colors, and then in the future we could add exclude_topology_colors. Note the plural since we should make these vectors in case we want to allow multiple colors in the future.

#[derive(BorshSerialize, BorshDeserialize, Debug)]
pub struct LinkColorInfo {
pub name: String, // e.g. "unicast-default"
pub admin_group_bit: u8, // auto-assigned, 0–127
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto-assigned from global ResourceExtension "AdminGroupBits"

…ulti-color, cleanup

- Replace onchain feature flag with controller features.yaml config file
- Add LinkColorInfo account with AdminGroupBits ResourceExtension for
  persistent bit allocation; bits never reused after deletion
- Change link_color: Pubkey to link_colors: Vec<Pubkey> (cap 8)
- Add include_topology_colors: Vec<Pubkey> on Tenant for per-tenant
  color assignment; defaults to UNICAST-DEFAULT (color 1)
- Redesign interface admin-group cleanup: overwrite remaining colors
  on deletion rather than targeted named no command
- Add full revert: enabled: false removes all flex-algo config
- Pin UNICAST-DEFAULT as protocol invariant (bit 0, first color created)
- Add controller startup check blocking enabled: true if any Vpn4v
  loopback has unset flex_algo_node_segment_idx
- Clarify clear sweep atomicity and idempotency
- Address all PR review comments (nikw9944, vihu, elitegreg)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment on lines +448 to +451
{{- range $.LinkColors }}
{{- if .FlexAlgoNodeSegmentIdx }}
node-segment ipv4 index {{ .FlexAlgoNodeSegmentIdx }} flex-algo {{ .Name }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense to me. Why are we creating a node segment index per link color? Shouldn't this be based on the topology color of the tenant?

no neighbor {{ . }}
{{- end }}
{{- if and $.Config.FlexAlgo.Enabled .LinkColors }}
next-hop resolution ribs tunnel-rib colored system-colored-tunnel-rib system-connected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong. If resolution via system-colored-tunnel-rib fails (i.e. color communities are disabled on a device), system-connected will be used to resolve routes which are only directly connected interfaces, not default IS-IS SR tunnels.


`include_topology_colors` MUST only be set by foundation keys. This is a routing policy decision — contributors MUST NOT be able to steer their own traffic onto a different topology by modifying this field.

When a tenant has one entry in `include_topology_colors`, the controller resolves the `LinkColorInfo` PDA and stamps its EOS color value on inbound routes for that tenant. When a tenant has multiple entries, the controller stamps all corresponding color values — EOS then selects the best available colored tunnel by IGP metric (lowest metric wins; highest color number breaks ties). This enables a fallback chain: if the preferred topology's tunnel becomes unavailable, EOS automatically falls back to the next-best color on the same prefix without the route going unresolved. This behavior has been verified in lab testing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the use case for this. Isn't fallback handled via the resolution ribs?

- `--link-color default` sets `link_colors` to an empty vector, removing any color assignment.
- `doublezero link get` and `doublezero link list` MUST include `link_colors` in their output, showing the resolved color names (or "default").

#### Tenant topology color assignment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not sure we should call this color when it's really a topology type. It makes it hard to understand what the color is used for as opposed to calling it what it is used for.

administrative-group alias UNICAST-DEFAULT group 0
flex-algo
flex-algo 128 unicast-default
administrative-group include any 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest have a common excluded color here for link draining. For example, if red is excluded here, we would be able to drain topology-specific traffic off the link by applying a red affinity value to a link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants