fix(seo): reduce sitemap bloat by filtering versioned docs and low-value pages#2016
Conversation
…lue pages
The sitemap contained ~5,200 URLs across EN and ZH, with ~70% being
redundant versioned documentation pages. This wastes crawl budget and
dilutes the indexing signal for the pages that matter.
Changes:
1. Update sitemap merge script (scripts/update-sitemap-loc.js):
- Filter out versioned doc URLs (e.g. /docs/apisix/3.14/) since
the unversioned paths (/docs/apisix/) already serve the latest
- Filter out /docs/.../next/ (unreleased dev docs)
- Filter out /search, /blog/tags/, /blog/page/ pages
- Estimated reduction: ~2,200 URLs removed from EN sitemap,
similar for ZH
2. Update robots.txt (website/static/robots.txt):
- Add Disallow rules for all versioned doc paths and /next/
- Add Disallow for /search, /blog/tags/, /blog/page/
- Ensures robots.txt and sitemap send consistent signals to crawlers
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Reduces sitemap size and duplicate/low-value indexing by excluding versioned/dev docs and aggregator pages, and aligning crawler signals via robots.txt.
Changes:
- Add post-merge sitemap URL filtering to drop versioned docs,
nextdocs, search, blog tags, and blog pagination. - Update
robots.txtto disallow the same low-value/versioned paths for both EN and ZH locales.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| scripts/update-sitemap-loc.js | Adds exclusion rules and filters merged sitemap entries to reduce bloat/duplicates. |
| website/static/robots.txt | Adds disallow rules to keep robots directives consistent with the filtered sitemap. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const before = urls.length; | ||
| sitemap.urlset.url = urls.filter((entry) => { | ||
| const loc = entry.loc && entry.loc._text; | ||
| return !loc || !shouldExclude(loc); |
There was a problem hiding this comment.
This currently keeps entries that don’t have a valid loc (return !loc || ...). In a sitemap, entries without loc are invalid and should be removed to avoid generating a malformed sitemap. Consider changing the predicate to require loc and then apply the exclude filter (i.e., drop entries without loc).
| return !loc || !shouldExclude(loc); | |
| return Boolean(loc) && !shouldExclude(loc); |
Summary
Reduce the sitemap from ~5,200 URLs to ~2,700 by filtering out redundant versioned documentation pages, development docs, and low-value pages. Update robots.txt to match.
Problem
The sitemap includes every versioned doc page across 7 projects x 6 versions (3.10-3.15) + next. For example,
/docs/apisix/getting-started/(latest) and/docs/apisix/3.14/getting-started/(old version) both appear. This wastes crawl budget and causes duplicate content confusion.Additionally,
/search,/blog/tags/, and/blog/page/were being included in the sitemap despite being low-value pages.Changes
1. Sitemap merge script (
scripts/update-sitemap-loc.js)Added URL filtering during post-build sitemap merge. Excludes:
/docs/<project>/<version>/- versioned doc pages/docs/<project>/next/- unreleased dev docs/search,/blog/tags/,/blog/page/Unversioned latest doc paths (e.g.
/docs/apisix/getting-started/) are kept.2. robots.txt (
website/static/robots.txt)Added Disallow rules for all versioned doc paths, next docs, search, blog tags, and blog pagination across both locales. Ensures robots.txt and sitemap send consistent signals.
Expected result