diff --git a/pip/pip-485.md b/pip/pip-485.md new file mode 100644 index 0000000000000..85752d919b31a --- /dev/null +++ b/pip/pip-485.md @@ -0,0 +1,220 @@ +# PIP-485: Configurable mTLS principal mapping (SAN sources and DN mapping rules) + +# Background knowledge + +When a Pulsar broker authenticates a client, it asks an `AuthenticationProvider` (in `pulsar-broker-common`) a simple question: "who is this client?". The provider looks at the connection and returns one `String` - the client's **role** (also called the principal). Pulsar then uses that role for two things: the `AuthorizationProvider` decides what the role is allowed to do, and the role is compared against the `superUserRoles` and `proxyRoles` lists in the config. So the role string is the link between "who you are" and "what you can do." + +With mTLS (mutual TLS), the client proves who it is by presenting an X.509 certificate during the TLS handshake. A certificate carries identity in a couple of places: + +- The **Subject**, written as a **Distinguished Name (DN)**, for example `CN=alice,OU=payments,O=Acme,C=US`. The DN is a list of small pieces called **RDNs**. Each RDN is one attribute: **CN** (Common Name), **OU** (Organizational Unit), **O** (Organization), **C** (Country), `emailAddress`, and so on. +- The **Subject Alternative Names (SANs)**, a separate part of the certificate that holds *typed* identities: a DNS name, an email address, a URI, or an IP address. + +These days the real identity often lives in a SAN, not in the CN. In service meshes and workload-identity systems like **SPIFFE/SPIRE**, the identity is a URI SAN such as `spiffe://acme.com/ns/payments/sa/checkout`, and the CN is usually empty. [RFC 9525] (the current standard, which replaced [RFC 6125] in 2023) actually says you should stop using the CN for identity and use SANs instead. (For the curious: the DN text format that Java's `X500Principal.getName()` emits is [RFC 2253] - note RFC 2253 was later obsoleted by [RFC 4514] as the general standard, but the Java API still defines this method's output in RFC 2253 terms. The SAN types are listed in [RFC 5280] §4.2.1.6 - email is type 1, DNS is type 2, URI is type 6.) + +For comparison, Apache Kafka lets operators control this with a config option called `ssl.principal.mapping.rules` ([KIP-371]). It's an ordered list of rules, each written as `RULE:pattern/replacement/[LU]`, applied to the certificate's DN. A special `DEFAULT` rule means "just use the whole DN." This lets people map their existing certificates to Kafka principals with a few lines of config instead of re-issuing certificates. Kafka also has a programmatic `KafkaPrincipalBuilder` SPI ([KIP-189]) for people who want to write code instead. This PIP copies the *config* idea from KIP-371, not the SPI from KIP-189 - Pulsar already has its own SPI (see *Alternatives*). + +# Motivation + +Pulsar's built-in TLS provider, `AuthenticationProviderTls`, only ever looks at the **Common Name**, and the way it reads it is hardcoded (`pulsar-broker-common/.../authentication/AuthenticationProviderTls.java`, +`authenticate()`): + +```java +public String authenticate(AuthenticationDataSource authData) throws AuthenticationException { + String commonName = null; + ErrorCode errorCode = ErrorCode.UNKNOWN; + try { + if (authData.hasDataFromTls()) { + // javadoc + Certificate[] certs = authData.getTlsCertificates(); + if (null == certs) { + errorCode = ErrorCode.INVALID_CERTS; + throw new AuthenticationException("Failed to get TLS certificates from client"); + } + String distinguishedName = ((X509Certificate) certs[0]).getSubjectX500Principal().getName(); + for (String keyValueStr : distinguishedName.split(",")) { + String[] keyValue = keyValueStr.split("=", 2); + if (keyValue.length == 2 && "CN".equals(keyValue[0]) && !keyValue[1].isEmpty()) { + commonName = keyValue[1]; + break; + } + } + } + + if (commonName == null) { + errorCode = ErrorCode.INVALID_CN; + throw new AuthenticationException("Client unable to authenticate with TLS certificate"); + } + authenticationMetrics.recordSuccess(); + } catch (AuthenticationException exception) { + incrementFailureMetric(errorCode); + throw exception; + } + return commonName; +} +``` + +That "first CN wins, otherwise fail" logic is a problem for two kinds of users that are becoming more common: + +1. **Companies with an existing PKI**, where the identity they care about is in some other field - an `OU`, an `emailAddress`, or the full DN. They often can't just re-issue thousands of certificates to move that value into the CN. +2. **SPIFFE/SPIRE and service meshes**, where the identity is a URI SAN (`spiffe://...`) and the CN is empty on purpose. Right now those certificates are rejected outright by the `commonName == null` check, even though they carry a perfectly valid, cryptographically-signed identity. + +You *can* work around this today by writing your own `AuthenticationProvider`, since the SPI lets you return any role string you want. But writing and maintaining a custom provider is a lot of work for something Kafka gives you in a few lines of config. The built-in provider - which is what almost everyone actually uses - should be able to read identity from a SAN and apply simple mapping rules. + +# Goals + +## In Scope + +- Let operators pick **which part of the certificate** holds the identity: the CN (today's behavior), the full DN, or a specific SAN type (URI / DNS / email). +- Let operators apply an ordered list of **mapping rules** to turn that raw value into the final Pulsar role, using the same rule format as Kafka so existing knowledge carries over. +- **Keep today's behavior as the default.** If you don't set anything new, nothing changes. +- Make it work the same way everywhere the provider is used: the binary TLS path, the HTTPS client-cert path, and the **Pulsar Proxy**. + +## Out of Scope + +- A general `PrincipalBuilder` SPI like Kafka's. Pulsar already has the `AuthenticationProvider` SPI for full customization; this PIP just improves the built-in provider. +- Anything about TLS trust, certificate chains, revocation (CRL/OCSP), or ciphers. By the time `authenticate()` runs, the TLS layer has already validated the certificate. We only change how the identity is *read* from it. +- Returning more than one role / groups from a certificate. The provider still returns a single role. + +# High Level Design + +We add two config options that `AuthenticationProviderTls` reads: + +1. **`tlsCertIdentitySource`** - picks the raw identity from the certificate. One of `CN` (default), `DN`, or `SAN:` where `` is `URI`, `DNS`, or `EMAIL`. +2. **`tlsAuthPrincipalMappingRules`** - an ordered list of rules that transform that raw value into the final role. The format matches Kafka's ([KIP-371]): each rule is either `DEFAULT` (use the value as-is) or `RULE://[LU]` (if the regex matches, build the replacement from the captured groups, optionally lower-cased with `L` or upper-cased with `U`). Rules run top to bottom and the first match wins. If nothing matches and there's no `DEFAULT`, authentication fails with a clear message. + +If you set neither option, the provider does exactly what it does today (read the first CN, fail if it's missing). So this is purely additive and safe to upgrade into. The final role string flows into authorization just like it does now, so nothing downstream needs to change. + +# Detailed Design + +## Design & Implementation Details + +### Reading the identity + +Add a small helper, `TlsIdentityExtractor`, called from `AuthenticationProviderTls.authenticate()` right after the existing null-check on the certificate: + +- `CN` - the current logic (first `CN=` part of the subject). +- `DN` - the full subject string from `((X509Certificate) certs[0]).getSubjectX500Principal().getName()` ([RFC 2253] format), used as-is. +- `SAN:URI` / `SAN:DNS` / `SAN:EMAIL` - read `X509Certificate.getSubjectAlternativeNames()` and take the first SAN of the matching type ([RFC 5280]: URI = 6, DNS = 2, email = 1). If there are several of the same type, the first one in the certificate is used, and a mapping rule can sort out which one you want. (Pulsar already reads SANs this way in `org.apache.pulsar.common.tls.TlsHostnameVerifier`, so this isn't new ground.) + +If the chosen source has no value (for example, you asked for a URI SAN but the certificate doesn't have one), authentication fails with a message that says which source was missing, and a failure metric is recorded with a specific error code. + +### Applying the rules + +A second helper, `PrincipalMappingRules`, parses `tlsAuthPrincipalMappingRules` once at `initialize()` and compiles the regexes into an ordered list. `apply(identity)`returns the first rule that matches. The format and behavior match Kafka's on purpose, so the docs and the operator's existing knowledge carry over. Example: + +``` +tlsCertIdentitySource = SAN:URI +tlsAuthPrincipalMappingRules = RULE:^spiffe://acme\.com/ns/([^/]+)/sa/([^/]+)$/$1__$2/L,DEFAULT +``` + +Here the SPIFFE id `spiffe://acme.com/ns/payments/sa/checkout` becomes the role `payments__checkout`. + +### Where it hooks in + +Only `AuthenticationProviderTls.authenticate(AuthenticationDataSource)` changes. The `AuthenticationProvider` interface, the `AuthenticationDataSource`, and every caller stay the same. Because HTTPS client-cert requests go through this same provider, both transports are covered by the one change. + +**The proxy needs the config too - this is real work, not free.** The Pulsar Proxy builds its `AuthenticationService` from a `ServiceConfiguration` that it gets by calling `PulsarConfigurationLoader.convertFrom(ProxyConfiguration)`. That conversion copies fields **by matching name**. So the two new keys have to be added to *both* `ServiceConfiguration` and `ProxyConfiguration`, with the same names. If they're only in `ServiceConfiguration`, the proxy quietly ignores them and falls back to CN - and then the proxy and the broker can map the same certificate to different roles. A test that goes through the proxy should confirm the config actually takes effect. + +## Public-facing Changes + +### Public API + +None. No new REST endpoint, no client API change, no change to the SPI signatures. + +### Binary protocol + +None. The TLS handshake and the auth command don't change. The only difference is how +the broker reads the certificate it already received. + +### Configuration + +Two new properties in `broker.conf` / `ServiceConfiguration`, also added to `proxy.conf` / `ProxyConfiguration` (see the proxy note above): + +- `tlsCertIdentitySource` - one of `CN` (default), `DN`, `SAN:URI`, `SAN:DNS`, `SAN:EMAIL`. Picks the raw identity from the certificate. +- `tlsAuthPrincipalMappingRules` - comma-separated, ordered list of `DEFAULT` / `RULE://[LU]`. Empty by default, which means "use the value as-is" and keeps today's behavior for `CN`. + +### CLI + +None. + +### Metrics + +Reuse the provider's existing `AuthenticationMetrics` failure path. The provider already has an `ErrorCode` enum (`UNKNOWN`, `INVALID_CERTS`, `INVALID_CN`); add two codes so operators can tell failures apart: + +- `INVALID_CN` (existing) - CN source chosen but no CN present. +- `NO_SAN_OF_TYPE` (new) - SAN source chosen but no SAN of that type present. +- `NO_MAPPING_RULE_MATCHED` (new) - identity read, but no rule (and no `DEFAULT`) matched. + +These show up on the existing per-provider failure counter, labeled by error code. + +# Monitoring + +After rolling out a mapping config, operators should watch the new failure error codes. A jump in `NO_MAPPING_RULE_MATCHED` or `NO_SAN_OF_TYPE` usually means a rule or source is misconfigured and is now rejecting clients that should be let in. When migrating from CN to SAN/DN identity, keep an eye on that per-code counter to confirm the new rules cover all your certificates before you remove any `DEFAULT` fallback. + +# Security Considerations + +This PIP changes how a certificate turns into a role, so it deserves a careful security review: + +- **A bad rule can hand out the wrong role.** A regex that's too loose (like `RULE:.*//admin/`) could map lots of certificates to a powerful role. The docs need to push people toward anchored patterns (`^...$`) and remind them that order matters. Rules are compiled at startup, so a broken rule stops the broker from booting instead of failing silently later. +- **Changing the role string can change who's a super-user or proxy.** `superUserRoles` and `proxyRoles` are matched by exact string. If you change the identity source, a certificate that used to map to a super-user role under CN might not anymore (or, with a sloppy rule, something might newly become one). Re-check those lists whenever you change the source. +- **This never weakens certificate validation.** The identity is only read *after* the TLS layer has already validated and trusted the certificate. Picking a SAN doesn't skip any of that. +- **The empty-CN change only happens if you opt in.** Today an empty CN fails. With a SAN or DN source, those certificates can now authenticate. But that only happens when an operator deliberately sets a non-CN source, so no existing setup changes on its own. +- **No change to multi-tenancy.** The role string is the same one authorization already uses, and tenant isolation is enforced downstream exactly as before. + +If anything here is unclear, we should confirm the rule format and SAN handling on the mailing list before merging. + +# Drawbacks and Pitfalls + +It's worth being honest about the downsides and the easy ways to get this wrong. + +## Drawbacks (cons) + +- **More config to get right.** Two new options on a security-sensitive provider means two more things an operator can misconfigure. The default keeps today's behavior, but anyone who turns this on takes on that responsibility. +- **Regex-on-a-DN is powerful but fiddly.** Mapping rules give a lot of freedom, and freedom means foot-guns. Kafka has lived with the same trade-off, so it's well-understood, but it's still a sharp tool. +- **Still one role per certificate.** If you're a SPIFFE shop that wants groups or multiple roles from one certificate, this doesn't get you there. That's a separate problem. +- **Doesn't touch revocation.** This is only about reading identity. CRL/OCSP and certificate trust are out of scope, so don't expect this to improve them. + +## Pitfalls (easy mistakes) + +- **Forgetting the proxy.** If you add the keys to the broker but not the proxy, the proxy silently falls back to C and you get a split-brain where the proxy and broker disagree on a client's role. Always configure both, and test through the proxy. +- **The DN isn't the string you think it is.** `X500Principal.getName()` returns the [RFC 2253] form, which (a) lists the RDNs in *reverse* order from how people usually write them, (b) escapes special characters, and (c) renders attributes with no short name as `OID=#hexvalue`. A rule written against a "nice-looking" DN may never match the real one. Write your rules against the actual canonical string, and the docs should show realistic examples. +- **"First SAN of that type" when there are several.** A certificate can carry more than one URI (or DNS, or email) SAN. We take the first one in certificate order, so if order matters to you, use a mapping rule to pin down the one you want. +- **Too-broad rules leak privilege.** This is the security point above, repeated here because it's the most common mistake: anchor your patterns and order your rules so a catch-all doesn't accidentally grant `admin`. +- **Rollback can lock people out.** If certificates authenticated *only* because of a SAN/DN source plus rules, they'll stop working after a downgrade to a version without this feature, or after you remove the config. If you need rollback safety, keep a working CN in those certificates as a fallback. + +# Backward & Forward Compatibility + +## Upgrade + +Nothing to do. If you leave the new options unset, `AuthenticationProviderTls` behaves exactly like before (first CN, fail on empty CN). You opt in by setting `tlsCertIdentitySource` / `tlsAuthPrincipalMappingRules`. + +## Downgrade / Rollback + +Remove the two properties, or downgrade the broker/proxy, and behavior goes back to CN-only. Be careful: any certificate that authenticated only because of a SAN/DN source plus rules will stop authenticating after rollback. If you need to be able to roll back, keep a usable CN in those certificates (see *Pitfalls*). + +## Pulsar Geo-Replication Upgrade & Downgrade/Rollback Considerations + +This is a per-broker (and per-proxy) setting. In a geo-replicated cluster, each broker reads roles from the certificates presented to *it* - brokers replicate data, not TLS sessions, so there's no cross-cluster coupling. Just make sure brokers (and proxies) that might serve the same clients use the same mapping config, so a client doesn't get a different role depending on which broker it happens to hit. + +# Alternatives + +- **Add a full `PrincipalBuilder` SPI like Kafka's [KIP-189].** More flexible, but Pulsar already has `AuthenticationProvider` as its extension point. Adding a second SPI for the same job would be redundant. Improving the built-in provider covers the common cases with no code from operators. +- **Ship a separate `AuthenticationProviderTlsSan` provider.** Rejected. It forks the TLS provider and forces operators to choose between providers instead of just configuring one identity policy. One configurable provider is simpler. +- **Hardcode "try SAN first, then CN."** Rejected. It's too opinionated - different organizations want different sources - and silently changing the default would be a breaking change. + +# General Notes + +The change is deliberately small and contained: one helper to read the identity, one helper to compile the rules, two new config keys (mirrored in `ServiceConfiguration` and `ProxyConfiguration`), and two new failure error codes, all behind `AuthenticationProviderTls`. Reusing Kafka's rule format is a conscious choice to make life easier for people who run both systems. + +# Links + +* Apache Kafka KIP-371 - *Add a configuration to build custom SSL principal name* (`ssl.principal.mapping.rules`): https://cwiki.apache.org/confluence/display/KAFKA/KIP-371%3A+Add+a+configuration+to+build+custom+SSL+principal+name +* Apache Kafka KIP-189 - *Improve principal builder interface and add support for SASL* (`KafkaPrincipalBuilder` SPI): https://cwiki.apache.org/confluence/display/KAFKA/KIP-189%3A+Improve+principal+builder+interface+and+add+support+for+SASL +* RFC 9525 - *Service Identity in TLS* (current standard; obsoletes RFC 6125 - use SAN, not CN, for identity): https://www.rfc-editor.org/rfc/rfc9525 +* RFC 6125 - *Representation and Verification of Domain-Based Application Service Identity ...* (obsoleted by RFC 9525; kept here for historical reference): https://www.rfc-editor.org/rfc/rfc6125 +* RFC 2253 - *LDAPv3 UTF-8 String Representation of Distinguished Names* (the DN string format that `X500Principal.getName()` is defined to return; obsoleted as a general standard by RFC 4514, but still the format the Java API emits): https://www.rfc-editor.org/rfc/rfc2253 +* RFC 4514 - *LDAP: String Representation of Distinguished Names* (current standard; obsoletes RFC 2253): https://www.rfc-editor.org/rfc/rfc4514 +* RFC 5280 - *Internet X.509 PKI Certificate and CRL Profile*, §4.2.1.6 Subject Alternative Name (SAN types): https://www.rfc-editor.org/rfc/rfc5280#section-4.2.1.6 +* SPIFFE ID specification (URI-SAN workload identity): https://github.com/spiffe/spiffe/blob/main/standards/SPIFFE-ID.md +* Mailing List discussion thread: https://lists.apache.org/thread/18clc2l50nrkoyhgo0pddw80y9zyd7sp +* Mailing List voting thread: TBD