custom domains: middleware and verification#2897
custom domains: middleware and verification#2897Soxasora wants to merge 160 commits intostackernews:masterfrom
Conversation
- ACM support - custom domains crud, resolvers, fragments - custom domains form, guidelines - custom domains context - domain verification every 5 minutes via pgboss - domain validation schema - basic custom domains middleware, to be completed - TODOs tracings
- CustomDomain -> Domain - DomainVerification table - CNAME, TXT, SSL verification types - WIP DomainVerification upsert
…ange status of a Record from its Attempt, multi-purpose dns verification
…tch as they can fail
- use DomainVerificationStatus enum for domains and records - adapt Territory Form UI to new schema - return 'records' as an object with its types - wip: prepare for attempts and certificate usage for prisma
fix: - fix setDomain mutation transaction - fix schema typedefs enhance: - DNS records guidelines with flex-wrap for longer records cleanup: - add comments to worker - remove console.log on validation values
… HOLD handle territory changes via triggers - on territory stop, HOLD the domain - on territory takeover from another user, delete the domain and its associated records handle ACM certificates via trigger - on domain/domainCertificate deletion, ask ACM to delete the certificate via a pgboss job; removes the need to ask ACM in multiple places clear domains that have been on HOLD for more than 30 days, check every midnight via pgboss schedule use 'domains' profile for worker jobs
…not valid or we're in production
… same certificate if requested multiple times in 1 hour
…sed for personalization
…y're already unique with their own indexes
| {append} | ||
| </span> | ||
| ) | ||
| } |
There was a problem hiding this comment.
CopyButton append mode never shows copied visual feedback
Low Severity
The new append rendering path in CopyButton renders a static {append} element and never reflects the copied state visually. The default button mode swaps its content to a Thumb icon on success, but the append mode always renders the same ClipboardLine icon regardless of whether the copy succeeded. Users relying on the clipboard icon (used in DomainGuidelines) get no inline feedback beyond the toast notification.
Reviewed by Cursor Bugbot for commit 616ce1f. Configure here.
There was a problem hiding this comment.
This all looks sane and well organized to me. Hard to tell exactly how it'll interact with all the reverse proxies, but that's what the admin gating is for.
Not blocking, but I had a bit of trouble navigating the state machine.
IMO there a few areas where my confidence is not super high:
- state machine related issues
- interactions with aws (out of our control mostly until we're in prod)
- header handling in the proxy (you made it dead simple but I need to walk through it another time or two)
I'll continue reviewing tonight to gain more confidence in those areas, but with admin gating, this is good to go afaict.
You mentioned you had some minor fixes you wanted to make before merging so I'll wait for your green light.
| # reachable by containers on 172.30.0.2(:53), outside of docker with 0.0.0.0:5353 | ||
| DOMAINS_DNS_SERVER=172.30.0.2 | ||
| # custom domains debugging | ||
| NEXT_PUBLIC_CUSTOM_DOMAINS_DEBUG=1 |
There was a problem hiding this comment.
Do we need to add this to the service worker?
huumn
left a comment
There was a problem hiding this comment.
found a few non-blocking things in the proxy
| // NextJS recommends to not add the CSP header to prefetches and static assets | ||
| // See https://nextjs.org/docs/app/building-your-application/configuring/content-security-policy | ||
| { | ||
| source: '/((?!api|_next/static|_error|404|500|offline|_next/image|_next/webpack-hmr|favicon.ico).*)', |
There was a problem hiding this comment.
Because we exclude /api, calls to /api/graphql do not strip x-stackernews-subname. This isn't a problem now because we don't read the header outside of getserversideprops, but if that changes we'll be reading a spoofable x-stackernews-subname. I'm not sure why that might change though.
It's at least worth commenting on. We could also go further running a little header stripping middleware when source: '/((?api|_next/static|_error|404|500|offline|_next/image|_next/webpack-hmr|favicon.ico).*)'.
There was a problem hiding this comment.
I've decided to use the cached domain mappings where we have an asynchronous context, such as getGetServerSideProps. It feels more appropriate.
| // TODO: handle auth sync | ||
|
|
||
| // clean up the pathname from any subname | ||
| if (pathname.startsWith('/~')) { |
There was a problem hiding this comment.
https://www.pizza.com/~bitcoin is redirected to https://www.pizza.com/bitcoin. It's kind of low severity given it'd have to be manually input, but it might be worth redirecting to https://stacker.news/~bitcoin or 404ing.
If we redirect https://www.pizza.com/~bitcoin to https://stacker.news/~bitcoin in the proxy, then we can probably avoid doing badge rewrites in components.
Not blocking either. Just a suggestion.
There was a problem hiding this comment.
https://www.pizza.com/~bitcoin redirects to https://www.pizza.com since that part removes the territory name.
It would be much easier to redirect to stacker.news via proxy indeed. I'll implement this!
There was a problem hiding this comment.
I've been trying to implement this but in dev this can't work due to a Next.js bug: vercel/next.js#44482
It seems like redirecting from www.pizza.com/~bitcoin to localhost:3000/~bitcoin won't work because localhost will be ignored and it will be treated as a relative redirect (www.pizza.com/~bitcoin) causing a redirect loop.
If I instead redirect to, idk, stacker.news it will work.
In production it should work, but the fact that it's untestable and will actually break the middleware in dev... makes this a bit weird to work with.
I think we can postpone this to the UI/UX PR, I like the idea of being able to redirect to stacker news in one place, probably we need to be creative.
|
I found these mermaid diagrams helpful for understanding how things progress through domain verification, aws calls, and the proxy. stateDiagram-v2
[*] --> PENDING: setDomain creates row
PENDING --> ACTIVE: verifyDomain returns VERIFIED
PENDING --> HOLD: 48h elapsed OR 3 AWS retries exhausted OR Sub STOPPED
ACTIVE --> HOLD: Sub STOPPED trigger
HOLD --> PENDING: setDomain re-verify on same domain
HOLD --> [*]: clearLongHeldDomains after 30 days
ACTIVE --> [*]: Sub owner changes
HOLD --> [*]: Sub owner changes
sequenceDiagram
participant W as worker
participant A as ACM
participant E as ELBv2
W->>A: "requestCertificate(domain, IdempotencyToken=domainId)"
A-->>W: certificateArn
W->>A: describeCertificate(arn)
A-->>W: "{ Status: PENDING_VALIDATION, ResourceRecord }"
Note right of W: store arn and SSL record in DB
loop every 30s to 5m while PENDING
W->>A: describeCertificate(arn)
A-->>W: Status
end
W->>E: "addListenerCertificates(listener, arn)"
E-->>W: "ok (no-op if already attached)"
Note over W,E: Domain ACTIVE
Note over W: Later, domain deleted or HOLD fires cleanup
W->>E: "removeListenerCertificates(listener, arn)"
E-->>W: "ok (no-op if not attached)"
W->>A: deleteCertificate(arn)
A-->>W: "ok (errors if cert in use)"
flowchart TB
Req[Incoming request] --> Strip[proxy.js strips x-stacker-news-subname]
Strip --> Ref[referrerMiddleware]
Ref --> HostCheck{host equals NEXT_PUBLIC_URL.host?}
HostCheck -- yes --> Sec[applySecurityHeaders]
HostCheck -- no --> Map{getDomainMapping returns ACTIVE row?}
Map -- no --> Sec
Map -- yes --> Custom[customDomainMiddleware]
Custom --> Rewrite{path starts with slash-tilde?}
Rewrite -- yes --> Redirect[redirect to cleaned path]
Rewrite -- no --> SubParam{has ?sub and mismatches?}
SubParam -- yes --> RedirectFix[redirect with sub coerced]
SubParam -- no --> IsTerr{path is / or territory path?}
IsTerr -- yes --> RW[rewrite to /tilde-sub/path and set header]
IsTerr -- no --> Pass[NextResponse.next and set header]
Redirect --> Cookies[applyReferrerCookies]
RedirectFix --> Cookies
RW --> Cookies
Pass --> Cookies
Cookies --> Sec
|
…data, used in place of headers in async contexts; cleanup
|
Cleaned the code just a bit and implemented |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 3 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 9eab96c. Configure here.
|
|
||
| return domains.reduce((acc, domain) => { | ||
| acc[domain.domainName.toLowerCase()] = { subName: domain.subName } | ||
| return acc |
There was a problem hiding this comment.
Domain mapping cache omits domainName from stored values
High Severity
The domainsMappingsCache stores each entry as { subName: domain.subName } without including domainName. Both proxy.js and ssrApollo.js then access mapping.domainName, which is always undefined. In the proxy, this passes undefined as the domain argument to customDomainMiddleware, causing the x-stacker-news-domain header to be set to the string "undefined" and debug logging to break. In SSR, the domain prop propagated to DomainProvider will have domainName: undefined.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 9eab96c. Configure here.
| } | ||
|
|
||
| export async function getDomainMappingFromRequest (req) { | ||
| const host = req.headers.get('host') |
There was a problem hiding this comment.
req.headers.get() crashes on plain object headers in SSR
High Severity
getDomainMappingFromRequest calls req.headers.get('host'), which uses the Web Headers API. This works in proxy.js where req is a NextRequest, but in ssrApollo.js the req comes from getServerSideProps and is a Node.js IncomingMessage with a plain object for headers. Calling .get() on a plain object throws TypeError, crashing every SSR page load.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 9eab96c. Configure here.


Description
Based on #2904, updated to Next.js 16
Part of #1942, revives #2145.
Introduces custom domains via Next.js proxy. This PR focuses on domain creation and verification.
The proxy detects custom domains and applies redirects and rewrites to generate a territory-centric SN version. Territory paths lose the
~subNameprefix, and attempts to go outside of the domain's territory are corrected via redirect.1Verification is a multi-step repeating process that involves DNS verification, AWS certificates creation/verification and attachment to the AWS load balancer listener.
AWS is simulated with Localstack on sndev.
More infos are available inside
docs/dev/custom-domains.md(todo: not finished)Screenshots
Domain verification starts right when a custom domain is saved
sndev: DNS is simulated with dnsmasq
sndev: AWS certificates (ACM) are simulated with Localstack and are automatically verified on the next call
custom domain is verified
custom domain overview
Concerns and known issues
SN_TERRITORY_PATHSis hardcoded and may drift@@index([domainName])domainName(line 918: @unique), which implicitly creates a B-tree index. The explicit@@index([domainName])is redundant.STOPPEDstatus can be handled via JS, but at the same time these triggers help us handle a custom domain lifecycle in the safest way possible.Checklist
Are your changes backwards compatible? Please answer below:
Yes, everything is an addition. An exception is the way we handle S3 via localstack, now the container is called AWS and also allows ACM simulations other than S3.
On a scale of 1-10 how well and how have you QA'd this change and any features it might affect? Please answer below:
7, QA OK, domain verification handles errors correctly and follows its step, middleware correctly detects a custom domain and obtains its connected sub.
For frontend changes: Tested on mobile, light and dark mode? Please answer below:
Yes, it uses standard SN colors and styles that fits with theme changes.
Did you introduce any new environment variables? If so, call them out explicitly here:
An endpoint for localstack that can be contacted by our worker container
LOCALSTACK_ENDPOINT=http://aws:4566local DNS server via dnsmasq, with fallback to 1.1.1.1 in both dnsmasq and code.
DOMAINS_DNS_SERVER=172.20.0.2:53tells Next.js what domains (sox.pizza) or wildcards (**.pizza) can be trusted for HMR in dev.
ALLOWED_DEV_ORIGINS=**.sndevcustom domains debug logs are gated by env var, by default they're disabled (0)
NEXT_PUBLIC_CUSTOM_DOMAINS_DEBUG=0Did you use AI?
review
Note
High Risk
Touches request routing middleware and introduces new background jobs + DB triggers that manage domain state and AWS certificate attachment, which can break navigation or incorrectly attach/detach certs if misconfigured. Also adds new persistence and lifecycle automation in production-critical infrastructure (routing, cert management).
Overview
Enables beta territory custom domains by adding a Next.js
proxymiddleware that detects mapped domains, injectsx-stacker-news-subname, and rewrites/redirects requests so territory routes work without the~subNameprefix while preserving existing referral/security header behavior.Adds a full domain lifecycle: new Prisma models/migration + triggers for domains/verification records/attempts/certificates, GraphQL
domain/setDomainAPIs (owner + beta-gated) that schedulepgbossverification jobs, and a worker-driven multi-step verifier that checks DNS CNAME, requests/polls ACM certs, and attaches/detaches certs to an ALB listener (with Localstack + mocked ELBv2 in dev) plus periodic cleanup of long-held domains.Updates UI/navigation and URL handling to be domain-aware (new
DomainProvider, prefix/nav-key helpers, external territory link handling), adds a custom domain management form under territory edit, and extends dev/docker env to run a Localstackawsservice and a newdomainscompose profile (plus new debug/HMR env vars).Reviewed by Cursor Bugbot for commit 9eab96c. Bugbot is set up for automated code reviews on this repo. Configure here.
Footnotes
I'm not sure if this should be the behavior, maybe it's better to redirect/open a new tab to stacker.news ↩