Replace static discovery files with dynamic Django views#1061
Conversation
Discovery endpoints (robots.txt, llms.txt, llms-full.txt, sitemap.xml, .well-known/mcp.json) are now served dynamically by Django instead of as static files from frontend nginx. Key improvements: - llms.txt and llms-full.txt auto-populate Available Collections from the database, showing public corpuses with titles, slugs, doc counts - Hostnames resolved from request instead of THIS_HOST placeholder - robots.txt explicitly welcomes AI crawlers (GPTBot, ClaudeBot, etc.) - New sitemap.xml lists public corpuses and documents via DocumentPath - New .well-known/mcp.json provides MCP server discovery - Traefik routes discovery paths to Django in prod and CI configs - Links use proper inline Markdown format per llmstxt.org spec - All responses cached for 5 minutes to avoid repeated DB hits - Comprehensive test suite (6 test classes, 30+ test methods) https://claude.ai/code/session_017dHEzeJ81Afo5BjLMMGTAA
PR Review: Dynamic Discovery EndpointsGood overall approach — migrating these files to Django views eliminates the N+1 Query in
|
…app config, module-level imports, missing test, standardize rate limit display
PR Review: Dynamic Discovery EndpointsThis is a well-structured PR that solves a real problem (static files with Positives
IssuesBug: Cache key collision for multi-hostname deployments# views.py
@require_GET
@cache_page(CACHE_SECONDS)
def robots_txt(request: HttpRequest) -> HttpResponse:
base_url = _get_base_url(request) # varies by Host headerDjango's Fix: add from django.views.decorators.vary import vary_on_headers
@require_GET
@vary_on_headers("Host")
@cache_page(CACHE_SECONDS)
def robots_txt(request: HttpRequest) -> HttpResponse:Apply this to all five views. This is the same pattern Django's own Medium: Tight coupling to
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
- Add @vary_on_headers("Host") before @cache_page on all 5 discovery
views to prevent cache key collisions in multi-hostname deployments
- Wrap RATE_LIMIT_REQUESTS import in try/except with fallback to 100
so discovery module does not hard-depend on mcp.config
- Remove vacuous if-guard around private corpus slug assertion in test
- Add _sanitize_markdown_title() helper that strips leading '#' chars
from corpus titles before interpolating into Markdown headings
PR Review: Dynamic Discovery EndpointsOverall this is a well-structured PR with good test coverage. The architecture is clean and the feature makes practical sense. A few issues and suggestions below. Issues1. Decorator ordering: In Python, decorators apply bottom-up. The current order on all five views: @require_GET
@vary_on_headers("Host")
@cache_page(CACHE_SECONDS)
def robots_txt(request): ...means the execution chain is The correct order for Vary to influence cache key computation is 2. DRY violation in
3. No limit on corpus count in The helper fetches all public corpuses with no upper bound. With many hundreds of public corpuses this materializes into memory on every cache miss across The 4. Corpus
desc = (c["description"] or "").replace("\n", " ").replace("\r", "")5. Per Minor ObservationsTraefik rule fragility — Both
Test gap: sitemap document URLs — Tests verify that corpus URLs appear in the sitemap but don't exercise the What's Good
|
…ns, DRY sitemap, add corpus limit - Remove @vary_on_headers(Host) from all 5 discovery views since @cache_page already varies on the full request URI including host - Sanitize corpus descriptions in llms.txt and llms-full.txt by stripping newlines to prevent markdown structure injection - Extract _get_public_corpus_queryset() so sitemap_xml reuses the shared query instead of duplicating AnonymousUser + visible_to_user - Cap public corpus queries at MAX_PUBLIC_CORPUSES (500) to prevent unbounded memory usage - Move cache duration constant to opencontractserver/constants/discovery.py
PR Review: Dynamic Discovery EndpointsThis is a well-structured migration. The architecture is clean, the permission model correctly uses Bug / Correctness
The description field gets newlines replaced (lines 1085, 1370), but def _sanitize_markdown_title(title: str) -> str:
return title.lstrip("#").replace("\n", " ").replace("\r", "").strip()CLAUDE.md Violation: Magic Number
MAX_SITEMAP_DOCUMENTS = 1000Minor: Inconsistency in queryset access pattern
Minor:
|
Summary
Migrated
robots.txt,llms.txt,llms-full.txt,sitemap.xml, and.well-known/mcp.jsonfrom static frontend files to dynamic Django views that serve live data from the database. This enables these discovery endpoints to automatically reflect current corpus and document information without manual updates.Key Changes
New discovery views module (
opencontractserver/discovery/views.py):robots_txt(): Serves dynamic robots.txt with AI crawler allowlists and links to discovery endpointsllms_txt(): Concise MCP server documentation following llmstxt.org spec, with live corpus listingsllms_full_txt(): Comprehensive MCP API reference with tool parameters, resources, examples, and architecture diagramssitemap_xml(): XML sitemap including homepage, public corpuses, documents, and discovery endpointswell_known_mcp(): JSON discovery endpoint listing global and corpus-scoped MCP servers_get_public_corpuses(): Queries database for anonymous-user-visible corpuses_get_base_url(): Derives canonical base URL from request (handles both HTTP/HTTPS)URL routing (
opencontractserver/discovery/urls.py):config/urls.pyComprehensive test coverage (
opencontractserver/discovery/tests/test_discovery_views.py):Removed static files:
frontend/public/robots.txt,frontend/public/llms.txt,frontend/public/llms-full.txtfrontend/index.htmlto reference dynamic endpointsTraefik routing updates:
traefik.ymlandtraefik-ci.ymlto exclude discovery endpoints from frontend router (route to backend instead)Implementation Details
@cache_page(300)) to minimize database hitsCorpus.visible_to_user(AnonymousUser())permission modelTHIS_HOSTplaceholders)opencontractserver.mcp.confighttps://claude.ai/code/session_017dHEzeJ81Afo5BjLMMGTAA