Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 82 additions & 0 deletions community/04-Proposals/MEP20/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
---
slug: /MEP-20-full-layer-3-dataplane
title: MEP-20
sidebar_position: 20
---

# Full Layer 3

:::info
This document is work in progress.
:::

When we started with metal-stack, we decided to go full layer-3 for the dataplane for workloads. But the inventarization and installation process is done in a layer-2 segment with a traditional DHCP/TFTP/PXE approach.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When we started with metal-stack, we decided to go full layer-3 for the dataplane for workloads. But the inventarization and installation process is done in a layer-2 segment with a traditional DHCP/TFTP/PXE approach.
When we started with metal-stack, we decided to go full layer-3 for the dataplane for workloads. But the inventorization and installation process is done in a layer-2 segment with a traditional DHCP/TFTP/PXE approach.


This works well, does not require manual configuration steps on any of the components in the datacenter. New servers just need to be turned on and get the metal-hammer bootet via DHCP/TFTP/PXE and get registered and are ready to use.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This works well, does not require manual configuration steps on any of the components in the datacenter. New servers just need to be turned on and get the metal-hammer bootet via DHCP/TFTP/PXE and get registered and are ready to use.
This works well, does not require manual configuration steps on any of the components in the datacenter. New servers just need to be turned on and get the metal-hammer booted via DHCP/TFTP/PXE and get registered and are ready to use.


But there are downsides with this approach. Most notable:

- 2 different network topologies (L2 and L3) in the dataplane
- The switch port of a machine must be reconfigured between these two modes, once a machine changes from registered to installed and back.
- dhcp and tftp server is deployed in the management network of a partition. Connecting these services to a L2 segment on the leaf switches somehow mix control-plane (management) and dataplane traffic, which is not ideal from a security perspective.

We were searching for a proper solution which can achieve the same convenient and fast solution but within layer-3.

## Requirements

The following requirements must be fulfilled with a L3 replacement solution:

- Clear separation of control-plane (management) and dataplane traffic
- Same "no-touch" experience for new servers
- Configurability of metal-hammer version per partition in real time
- token based authentication against metal-apiserver of the metal-hammer
- Cache of metal-images accessible from metal-hammer inside a partition
- Preserve all existing metal-hammer discovery, hardware detection, and provisioning logic
- Secure network when machine reclaim goes wrong with ACLs on the switch which allows communication only to the control-plane and the `metal-boot`
- Optional: make `metal-boot` a proxy to metal-apiserver to support IPv4 only control-plane deployments.
- TODO more
Comment thread
majst01 marked this conversation as resolved.

## High level Architecture

The main idea is based on three concepts.

- Boot from ISO feature of server bmc firmware which can be configured from remote via redfish.
- Enable automated IPv6 address acquisition via SLAAC (RFC 4862) driven by Router Advertisements (RFC 4861) instead of DHCP
- IPv6 in a dedicated Boot VRF instead of a Boot VLAN.

This approach requires that metal-apiserver, metal-hammer, ipxe and a new component running in the partition and connected to the boot-vrf (`metal-boot` for now) are IPv6 ready.

The L3 only boot and registration process can be described as follows:

- Every server will be scanned on a regular basis from the metal-bmc if there is IPXE is configured as boot iso payload. This is a additional task on the metal-bmc. metal-bmc already scans all servers on a regular basis to gather power metrics etc.
- If the boot iso is set to ipxe, the boot order must be set to CDROM instead of PXE from network and a reboot must be triggered (migration to this approach, not when a machine is allocated).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- If the boot iso is set to ipxe, the boot order must be set to CDROM instead of PXE from network and a reboot must be triggered (migration to this approach, not when a machine is allocated).
- If the boot iso is set to ipxe, the boot source override must be set to CDROM instead of PXE from network and a reboot must be triggered (migration to this approach, not when a machine is allocated).

We already use overrides for PXE, so just a technicality.

- Once the server is powered on, ipxe is booted from the CDROM presented from the firmware.
- The production interfaces will then get a IPv6 routable ip address from the switch which is configured to enable SLAAC and router advertisement. The configured routes must enable the machine to reach the metal-apiserver in the control plane and the `metal-boot` in the partition.
- The IPXE iso must contain a boot configuration which chain loads from a known location a secondary boot configuration. To speed up the ipxe startup, the boot.ixpe should disable ipv4 completely as otherwise ipxe will try dhcp first.
Sample:
```ipxe
#!ipxe
chain https://v2.metal-stack.dev/<partition>/boot.ipxe || shell
```
The secondary boot.ipxe will then contain the same payload as actually delivered from pixiecore. This especially contains the configured linux kernel, metal-hammer version, command line and the url in the boot vrf of the boot-helper.
- With this ipxe will boot into metal-hammer and will contact first the boot-helper on the given url and will get a token to access the metal-apiserver
- metal-image-cache-sync address is also reachable in the boot vrf and works as before.

![Logical View](./layer-3-logical.drawio.svg)

![Sequence Diagram](./layer-3-sequence.drawio.svg)

## Implementation

Before we start with the implementation or decision if this is the right approach and way to go we should ensure that the current draft is at least working as expected.

This must be done in several steps:

- [ ] ensure ipxe can be packed as ISO image stored in the firmware, booted with DHCP disabled and get a IP with routes from a SLAAC enable switch.
- [ ] The initial boot.ipxe contains instruction to pull a secondary boot.ipxe which contains kernel, image and cmdline and ipxe chain boots this.
- [ ] can ipxe resolve hostnames to ipv6 addresses ?
- [ ] Specify how the boot vrf must be configured on the SONiC Side
- [ ] Specify how metal-hammer kernel must be configured to accept router advertisements
- [ ] how do we configure the boot vrf on the switch, e.g. which address space will be set per port, is it stored in the metal-apiserver and configured by metal-core.

After all these tasks are done, we can proceed and write a more detailed implementation roadmap and requirements with changes in the api and apiserver or other microservices.
33 changes: 33 additions & 0 deletions community/04-Proposals/MEP20/ai/ipxe-as-boot-iso.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
Approach: Convert ipxe.efi to ISO (simpler)

# Download prebuilt ipxe.efi
wget https://boot.ipxe.org/ipxe.efi

# Create ISO
xorriso -as mkisofs -o ipxe.iso -b ipxe.efi -no-emul-boot \
-isohybrid-mbr /usr/lib/GRUB/efimbridge.bin \
-c boot.cat \
-V "iPXE" \
-l \
--allow-disc \
-e ipxe.efi \
-b ipxe.efi \
/dev/null

# Or with genisoimage + xorriso
mkdir iso_root
cp ipxe.efi iso_root/ipxe.efi
cat > iso_root/boot.ipxe << 'EOF'
#!ipxe
kernel http://<server>/vmlinuz ip=dhcp root=/dev/ram0 ramdisk_size=...
initrd http://<server>/initrd.img
boot
EOF

xorriso -as mkisofs -o ipxe.iso -b iso_root/ipxe.efi -no-emul-boot \
-isohybrid-mbr -c boot.cat -V "iPXE" -l iso_root

Key point: The boot.ipxe file can be included in the ISO or hosted alongside ipxe.efi — the EFI firmware can load it directly via the iPXE shell:

#!ipxe
chain http://<server>/boot.ipxe || shell
Loading
Loading