Skip to content

Replace static discovery files with dynamic Django views#1061

Merged
JSv4 merged 4 commits intomainfrom
claude/review-robots-txt-llm-WuEAz
Mar 6, 2026
Merged

Replace static discovery files with dynamic Django views#1061
JSv4 merged 4 commits intomainfrom
claude/review-robots-txt-llm-WuEAz

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented Mar 4, 2026

Summary

Migrated robots.txt, llms.txt, llms-full.txt, sitemap.xml, and .well-known/mcp.json from static frontend files to dynamic Django views that serve live data from the database. This enables these discovery endpoints to automatically reflect current corpus and document information without manual updates.

Key Changes

  • New discovery views module (opencontractserver/discovery/views.py):

    • robots_txt(): Serves dynamic robots.txt with AI crawler allowlists and links to discovery endpoints
    • llms_txt(): Concise MCP server documentation following llmstxt.org spec, with live corpus listings
    • llms_full_txt(): Comprehensive MCP API reference with tool parameters, resources, examples, and architecture diagrams
    • sitemap_xml(): XML sitemap including homepage, public corpuses, documents, and discovery endpoints
    • well_known_mcp(): JSON discovery endpoint listing global and corpus-scoped MCP servers
    • Helper function _get_public_corpuses(): Queries database for anonymous-user-visible corpuses
    • Helper function _get_base_url(): Derives canonical base URL from request (handles both HTTP/HTTPS)
  • URL routing (opencontractserver/discovery/urls.py):

    • Registered all five discovery endpoints at root level
    • Integrated into main config/urls.py
  • Comprehensive test coverage (opencontractserver/discovery/tests/test_discovery_views.py):

    • 386 lines of tests covering all endpoints
    • Validates content type, format compliance, corpus visibility, hostname resolution
    • Tests for edge cases (no public corpuses, private corpus exclusion)
    • Verifies HTTP method restrictions (GET only)
  • Removed static files:

    • Deleted frontend/public/robots.txt, frontend/public/llms.txt, frontend/public/llms-full.txt
    • Updated frontend/index.html to reference dynamic endpoints
  • Traefik routing updates:

    • Modified traefik.yml and traefik-ci.yml to exclude discovery endpoints from frontend router (route to backend instead)

Implementation Details

  • All views use 5-minute caching (@cache_page(300)) to minimize database hits
  • Respects existing Corpus.visible_to_user(AnonymousUser()) permission model
  • Dynamically resolves request hostname to avoid hardcoded URLs (replaces THIS_HOST placeholders)
  • Sitemap limits document entries to 1000 most recently modified for performance
  • MCP documentation includes rate limit info from opencontractserver.mcp.config
  • All endpoints return appropriate content types (text/plain, application/xml, application/json)

https://claude.ai/code/session_017dHEzeJ81Afo5BjLMMGTAA

Discovery endpoints (robots.txt, llms.txt, llms-full.txt, sitemap.xml,
.well-known/mcp.json) are now served dynamically by Django instead of
as static files from frontend nginx.

Key improvements:
- llms.txt and llms-full.txt auto-populate Available Collections from
  the database, showing public corpuses with titles, slugs, doc counts
- Hostnames resolved from request instead of THIS_HOST placeholder
- robots.txt explicitly welcomes AI crawlers (GPTBot, ClaudeBot, etc.)
- New sitemap.xml lists public corpuses and documents via DocumentPath
- New .well-known/mcp.json provides MCP server discovery
- Traefik routes discovery paths to Django in prod and CI configs
- Links use proper inline Markdown format per llmstxt.org spec
- All responses cached for 5 minutes to avoid repeated DB hits
- Comprehensive test suite (6 test classes, 30+ test methods)

https://claude.ai/code/session_017dHEzeJ81Afo5BjLMMGTAA
@claude
Copy link
Copy Markdown

claude Bot commented Mar 4, 2026

PR Review: Dynamic Discovery Endpoints

Good overall approach — migrating these files to Django views eliminates the THIS_HOST placeholder problem and lets public corpus listings stay current automatically. The test coverage is thorough and the Traefik routing changes are correct. A few issues worth addressing before merge:


N+1 Query in _get_public_corpuses()

views.py calls corpus.document_count() inside a for loop. That method fires a separate DocumentPath DB query per corpus (opencontractserver/corpuses/models.py:917-931). With many public corpuses this becomes O(n) queries at request time — even with 5-minute caching, the first hit per interval is expensive.

Replace with a single annotated query:

from django.db.models import Count, Q

def _get_public_corpuses() -> list[dict]:
    anonymous = AnonymousUser()
    qs = (
        Corpus.objects.visible_to_user(anonymous)
        .annotate(
            doc_count=Count(
                "documentpath",
                filter=Q(documentpath__is_current=True, documentpath__is_deleted=False),
                distinct=True,
            )
        )
        .order_by("-created")
    )
    return [
        {"slug": c.slug, "title": c.title, "description": c.description or "", "document_count": c.doc_count}
        for c in qs
    ]

Also remove the hasattr(corpus, "document_count") guard — the method is always defined on the model, making that branch dead code.


Double Queryset Evaluation in sitemap_xml()

public_corpuses is a lazy queryset that gets evaluated twice: once during for corpus in public_corpuses: and again during public_corpuses.values_list("id", flat=True). Materialise it first:

public_corpuses = list(Corpus.objects.visible_to_user(anonymous).order_by("-created"))
public_corpus_ids = [c.id for c in public_corpuses]

opencontractserver.discovery Not in INSTALLED_APPS

The app is not added to LOCAL_APPS in config/settings/base.py. Django's test runner uses INSTALLED_APPS for test discovery — running python manage.py test will not find opencontractserver.discovery.tests.* unless the app is registered. Add it to LOCAL_APPS and add a matching apps.py (all existing apps have one).


Inline Model Imports Inside View Functions

Both _get_public_corpuses() and sitemap_xml() import models inside the function body. Move Corpus and DocumentPath imports to the module level following standard Django convention.


Missing test_only_get_allowed in LlmsFullTxtTest

RobotsTxtTest, LlmsTxtTest, SitemapXmlTest, and WellKnownMcpTest all verify POST returns 405. LlmsFullTxtTest is missing this check.


Rate Limit Display Inconsistency

RATE_LIMIT_WINDOW = 60 # seconds but the rendered strings differ:

  • llms_txt / llms_full_txt: "100 requests/60s per IP"
  • well_known_mcp: "100 requests per 60 seconds per IP" (double "per")

The deleted static file said "100 requests/minute per IP" which was clearest. Consider standardising across all endpoints.

Also, CACHE_SECONDS = 300 in views.py duplicates CACHE_TTL = 300 that already exists in opencontractserver/mcp/config.py. Worth reusing.


Minor Observations

  • Host header injection: _get_base_url() reads request.get_host(). Django raises DisallowedHost when ALLOWED_HOSTS is properly configured, so this is safe — just document that ALLOWED_HOSTS must not be ["*"] in production.
  • Sitemap URL patterns: Corpus URLs use /c/{slug} and documents /c/{slug}/d/{slug}. Confirm these match actual CentralRouteManager frontend paths so sitemap links resolve correctly for crawlers.
  • /.well-known/mcp.json absent from sitemap: The sitemap lists /llms.txt and /llms-full.txt but not /.well-known/mcp.json. Probably intentional (machine endpoint, not a crawlable page), but worth a comment.

What Works Well

  • Correct use of Corpus.objects.visible_to_user(AnonymousUser()) for privacy boundary enforcement
  • @require_GET + @cache_page(300) is the right pattern for these read-only endpoints
  • xml.etree.ElementTree.tostring() properly XML-escapes titles with special characters
  • json.dumps() handles JSON escaping correctly in the MCP discovery response
  • DocumentPath filter (is_current=True, is_deleted=False) mirrors document_count() logic exactly
  • Traefik rules correctly use !Path() (exact match) for file endpoints vs !PathPrefix() for directories
  • @override_settings(CACHES=...) in tests prevents cross-test cache contamination
  • DiscoveryNoCorpusesTest correctly exercises the empty-state code paths

…app config, module-level imports, missing test, standardize rate limit display
@claude
Copy link
Copy Markdown

claude Bot commented Mar 4, 2026

PR Review: Dynamic Discovery Endpoints

This is a well-structured PR that solves a real problem (static files with THIS_HOST placeholders). The overall approach is sound and the test coverage is comprehensive. Here are my findings:

Positives

  • Clean separation of concerns with _get_base_url() and _get_public_corpuses() helpers
  • Consistent use of visible_to_user(AnonymousUser()) — the correct permission pattern per project conventions
  • @require_GET on all endpoints (correct HTTP semantics)
  • Good test coverage for all 5 endpoints including edge cases (no public corpuses, private corpus exclusion)
  • The sitemap now correctly uses Sitemap: /sitemap.xml instead of the old Sitemap: /llms.txt which was invalid (a sitemap must be XML)
  • Removing the THIS_HOST placeholder anti-pattern is a meaningful improvement

Issues

Bug: Cache key collision for multi-hostname deployments

# views.py
@require_GET
@cache_page(CACHE_SECONDS)
def robots_txt(request: HttpRequest) -> HttpResponse:
    base_url = _get_base_url(request)  # varies by Host header

Django's @cache_page builds its cache key from the URL path, not the full Host header. If the site is served under multiple hostnames (e.g. opencontracts.opensource.legal and www.opencontracts.opensource.legal, which are both in traefik.yml), the first request caches a response embedding one hostname, and all subsequent requests — regardless of Host — get that cached response with the wrong embedded URLs.

Fix: add @vary_on_headers("Host") so Django includes the host in the cache key:

from django.views.decorators.vary import vary_on_headers

@require_GET
@vary_on_headers("Host")
@cache_page(CACHE_SECONDS)
def robots_txt(request: HttpRequest) -> HttpResponse:

Apply this to all five views. This is the same pattern Django's own django.contrib.sitemaps uses.


Medium: Tight coupling to mcp.config at module load time

# views.py line 6
from opencontractserver.mcp.config import RATE_LIMIT_REQUESTS

This import happens at module load time. If RATE_LIMIT_REQUESTS is ever renamed or the MCP module unavailable in a configuration that doesn't enable it, all discovery views will fail to import. A defensive fallback or moving the constant to a shared constants file would decouple these modules:

try:
    from opencontractserver.mcp.config import RATE_LIMIT_REQUESTS
except ImportError:
    RATE_LIMIT_REQUESTS = 100  # fallback default

Or better: define the constant in opencontractserver/constants/ (per CLAUDE.md's "no magic numbers" rule) and import it in both mcp/config.py and discovery/views.py.


Low: Weakened test assertion with unnecessary conditional

# test_discovery_views.py line 713
def test_excludes_private_corpus(self):
    response = self.client.get("/sitemap.xml")
    content = response.content.decode()
    if self.private_corpus.slug:  # <-- this guard is always True
        self.assertNotIn(f"/c/{self.private_corpus.slug}", content)

Slugs are always auto-generated for Corpus objects, so the guard is vacuously true and the assertion always runs. But the conditional makes it look like there's a case where the test is skipped. Removing the if makes the test's intent clearer:

self.assertNotIn(f"/c/{self.private_corpus.slug}", content)

Low: Corpus title/description could break Markdown structure

Corpus title and description are interpolated directly into Markdown-formatted text responses without escaping:

lines.append(f"### {title}")  # title containing "##" could nest incorrectly
if desc:
    lines.append(f"- Description: {desc}")  # backticks or > could break blockquote

A corpus titled ## Injected Heading would add a spurious H2 section to llms-full.txt. Since this data comes from user-controlled corpus titles, a simple sanitization of Markdown-special chars (or just truncating titles) would prevent formatting corruption. This isn't an XSS risk (text/plain response) but could confuse LLM parsers.


Minor Observations

  • The DiscoveryConfig in apps.py doesn't override ready(), which is correct since there are no signal handlers in this app.
  • The [:1000] document limit in sitemap_xml is a reasonable safeguard.
  • The sitemap builds the full XML tree in memory before returning — for large installations this is fine at 1000 docs, but a streaming approach would be more scalable if the cap were raised later.
  • Test classes correctly use @override_settings to disable caching, which is the right approach.
  • setUpTestData is used appropriately for shared test fixtures (avoids per-test DB setup overhead).

Summary

The critical issue to address before merging is the cache key collision for multi-hostname deployments — this would cause mismatched URLs in cached discovery responses. The MCP config coupling is worth fixing for long-term maintainability. All other findings are low-severity polish items.

Great work overall — this is a meaningful improvement over static placeholder files.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 4, 2026

Codecov Report

❌ Patch coverage is 96.38243% with 14 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
opencontractserver/discovery/views.py 89.78% 14 Missing ⚠️

📢 Thoughts on this report? Let us know!

- Add @vary_on_headers("Host") before @cache_page on all 5 discovery
  views to prevent cache key collisions in multi-hostname deployments
- Wrap RATE_LIMIT_REQUESTS import in try/except with fallback to 100
  so discovery module does not hard-depend on mcp.config
- Remove vacuous if-guard around private corpus slug assertion in test
- Add _sanitize_markdown_title() helper that strips leading '#' chars
  from corpus titles before interpolating into Markdown headings
@claude
Copy link
Copy Markdown

claude Bot commented Mar 4, 2026

PR Review: Dynamic Discovery Endpoints

Overall this is a well-structured PR with good test coverage. The architecture is clean and the feature makes practical sense. A few issues and suggestions below.


Issues

1. Decorator ordering: @cache_page + @vary_on_headers are in the wrong order

In Python, decorators apply bottom-up. The current order on all five views:

@require_GET
@vary_on_headers("Host")
@cache_page(CACHE_SECONDS)
def robots_txt(request): ...

means the execution chain is require_GET -> vary_on_headers -> cache_page -> view. Django's cache_page stores the response before vary_on_headers has a chance to add Vary: Host to it, so the cache key computation never sees the Vary header.

The correct order for Vary to influence cache key computation is @cache_page outer, @vary_on_headers inner. That said, Django's cache_page already builds its key from request.build_absolute_uri() which includes the scheme and host, so Vary-by-Host is implicitly handled already. The @vary_on_headers("Host") decorator is redundant and can be dropped entirely. Vary: Host is also not a standard HTTP response header for this purpose and may confuse CDNs.

2. DRY violation in sitemap_xml — re-implements corpus visibility logic

sitemap_xml creates its own AnonymousUser() and queries Corpus.objects.visible_to_user(anonymous) directly, while _get_public_corpuses() already encapsulates exactly this logic. Consider extracting a lower-level _get_public_corpus_queryset() helper (or adding an "id" key to the _get_public_corpuses() dict) so the sitemap can reuse the shared filtering.

3. No limit on corpus count in _get_public_corpuses()

The helper fetches all public corpuses with no upper bound. With many hundreds of public corpuses this materializes into memory on every cache miss across llms_txt, llms_full_txt, and well_known_mcp. The sitemap already caps documents at 1000 — a similar guard on corpuses would be prudent (e.g., [:500] or a named constant).

The well_known_mcp view is especially affected: with N public corpuses it generates N+1 MCP server entries in a single JSON object.

4. Corpus description is rendered into Markdown without sanitization

_sanitize_markdown_title() protects against heading injection in titles, but the description field is interpolated raw in both llms_txt and llms_full_txt. A description containing a newline followed by ## Injected Section would insert a fake top-level Markdown section into the served document. Since these are public-facing machine-readable files consumed by LLMs and crawlers, descriptions should at minimum have newlines stripped when embedded inline:

desc = (c["description"] or "").replace("\n", " ").replace("\r", "")

5. CACHE_SECONDS = 300 should use the centralized constants file

Per CLAUDE.md: "No magic numbers — use constants files in opencontractserver/constants/." The local CACHE_SECONDS = 300 in views.py should be moved there and imported.


Minor Observations

Traefik rule fragility — Both traefik.yml and traefik-ci.yml now enumerate every discovery path as explicit exclusions in the frontend router rule. Adding a new endpoint requires updating both files. Worth noting as technical debt.

DiscoveryNoCorpusesTest isolation — The test assumes no public corpuses exist in the DB. This works because TestCase + setUpTestData uses savepoints for per-class isolation, and parallel testing (-n 4) gives each worker its own DB. Fine as-is.

Test gap: sitemap document URLs — Tests verify that corpus URLs appear in the sitemap but don't exercise the DocumentPath branch. A test creating a Document + DocumentPath in a public corpus and asserting the /c/{corpus_slug}/d/{doc_slug} URL is present would complete coverage of that code path.


What's Good

  • New discovery app properly isolated with apps.py, urls.py, and tests/ package.
  • visible_to_user(AnonymousUser()) follows the established permission pattern throughout.
  • @require_GET correctly restricts to read-only access on all endpoints.
  • XML generation with manual UTF-8 declaration is correct — avoids the us-ascii encoding pitfall from xml_declaration=True with encoding="unicode".
  • try/except import of RATE_LIMIT_REQUESTS with a sensible fallback is good defensive coding.
  • Test coverage is thorough: content types, spec conformance, public/private corpus filtering, edge cases (no corpuses), method restrictions, and hostname resolution are all covered.
  • CHANGELOG entry is detailed and includes file paths.

…ns, DRY sitemap, add corpus limit

- Remove @vary_on_headers(Host) from all 5 discovery views since
  @cache_page already varies on the full request URI including host
- Sanitize corpus descriptions in llms.txt and llms-full.txt by
  stripping newlines to prevent markdown structure injection
- Extract _get_public_corpus_queryset() so sitemap_xml reuses the
  shared query instead of duplicating AnonymousUser + visible_to_user
- Cap public corpus queries at MAX_PUBLIC_CORPUSES (500) to prevent
  unbounded memory usage
- Move cache duration constant to opencontractserver/constants/discovery.py
@claude
Copy link
Copy Markdown

claude Bot commented Mar 5, 2026

PR Review: Dynamic Discovery Endpoints

This is a well-structured migration. The architecture is clean, the permission model correctly uses visible_to_user(AnonymousUser()), caching is sensibly applied, and the test coverage is thorough. A few things worth addressing:


Bug / Correctness

_sanitize_markdown_title does not handle embedded newlines (opencontractserver/discovery/views.py ~line 921)

The description field gets newlines replaced (lines 1085, 1370), but _sanitize_markdown_title only strips leading #. A corpus title with an embedded newline followed by a Markdown header would produce a spurious section in llms.txt / llms-full.txt. Since corpus titles come from user input, the function should also strip newline characters:

def _sanitize_markdown_title(title: str) -> str:
    return title.lstrip("#").replace("\n", " ").replace("\r", "").strip()

CLAUDE.md Violation: Magic Number

opencontractserver/discovery/views.py line ~1450 has a bare [:1000] slice on the document query. CLAUDE.md requires all magic numbers to live in constants files, and opencontractserver/constants/discovery.py is the right home — MAX_PUBLIC_CORPUSES already lives there. The document cap should be extracted too:

MAX_SITEMAP_DOCUMENTS = 1000

Minor: Inconsistency in queryset access pattern

sitemap_xml calls _get_public_corpus_queryset() directly (needs .modified and .id), while the other three views go through _get_public_corpuses() (returns dicts). This creates two code paths that need to stay in sync when the model changes. Consider either adding a comment explaining the split, or adding modified to the dict so all views share the same helper.


Minor: sitemap.xml omits /.well-known/mcp.json

Lines 1464–1469 add /llms.txt and /llms-full.txt to the sitemap but not /.well-known/mcp.json. Whether intentional is fine, but a short comment would avoid future questions.


Nit: @require_GET / @cache_page decorator order

The ordering is correct (require_GET outermost so non-GET requests are rejected before hitting the cache), but it reads opposite to what a maintainer might expect on first glance. A brief inline comment would help future readers.


Tests look solid

@override_settings(CACHES=...) correctly disables caching, DiscoveryNoCorpusesTest covers empty-state edge cases, hostname resolution is explicitly verified, and setUpTestData is appropriate for read-only fixtures. Good coverage across all five endpoints.


Summary: Two items to fix before merge — the _sanitize_markdown_title newline issue (could produce malformed Markdown from user-controlled corpus titles) and the magic number 1000 in the sitemap document cap. Everything else is minor polish.

@JSv4 JSv4 merged commit 7c4b933 into main Mar 6, 2026
16 checks passed
@JSv4 JSv4 deleted the claude/review-robots-txt-llm-WuEAz branch March 6, 2026 05:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants