Skip to content

fix(seo): reduce sitemap bloat by filtering versioned docs and low-value pages#2016

Merged
Yilialinn merged 1 commit intoapache:masterfrom
moonming:fix/sitemap-cleanup
Apr 13, 2026
Merged

fix(seo): reduce sitemap bloat by filtering versioned docs and low-value pages#2016
Yilialinn merged 1 commit intoapache:masterfrom
moonming:fix/sitemap-cleanup

Conversation

@moonming
Copy link
Copy Markdown
Member

@moonming moonming commented Apr 9, 2026

Summary

Reduce the sitemap from ~5,200 URLs to ~2,700 by filtering out redundant versioned documentation pages, development docs, and low-value pages. Update robots.txt to match.

Problem

The sitemap includes every versioned doc page across 7 projects x 6 versions (3.10-3.15) + next. For example, /docs/apisix/getting-started/ (latest) and /docs/apisix/3.14/getting-started/ (old version) both appear. This wastes crawl budget and causes duplicate content confusion.

Additionally, /search, /blog/tags/, and /blog/page/ were being included in the sitemap despite being low-value pages.

Changes

1. Sitemap merge script (scripts/update-sitemap-loc.js)

Added URL filtering during post-build sitemap merge. Excludes:

  • /docs/<project>/<version>/ - versioned doc pages
  • /docs/<project>/next/ - unreleased dev docs
  • /search, /blog/tags/, /blog/page/

Unversioned latest doc paths (e.g. /docs/apisix/getting-started/) are kept.

2. robots.txt (website/static/robots.txt)

Added Disallow rules for all versioned doc paths, next docs, search, blog tags, and blog pagination across both locales. Ensures robots.txt and sitemap send consistent signals.

Expected result

  • EN sitemap: ~2,638 -> ~1,360 URLs (~48% reduction)
  • ZH sitemap: ~2,620 -> ~1,340 URLs (~49% reduction)
  • Remaining URLs are high-value: latest docs, blog posts, main pages

…lue pages

The sitemap contained ~5,200 URLs across EN and ZH, with ~70% being
redundant versioned documentation pages. This wastes crawl budget and
dilutes the indexing signal for the pages that matter.

Changes:

1. Update sitemap merge script (scripts/update-sitemap-loc.js):
   - Filter out versioned doc URLs (e.g. /docs/apisix/3.14/) since
     the unversioned paths (/docs/apisix/) already serve the latest
   - Filter out /docs/.../next/ (unreleased dev docs)
   - Filter out /search, /blog/tags/, /blog/page/ pages
   - Estimated reduction: ~2,200 URLs removed from EN sitemap,
     similar for ZH

2. Update robots.txt (website/static/robots.txt):
   - Add Disallow rules for all versioned doc paths and /next/
   - Add Disallow for /search, /blog/tags/, /blog/page/
   - Ensures robots.txt and sitemap send consistent signals to crawlers
Copilot AI review requested due to automatic review settings April 9, 2026 16:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Reduces sitemap size and duplicate/low-value indexing by excluding versioned/dev docs and aggregator pages, and aligning crawler signals via robots.txt.

Changes:

  • Add post-merge sitemap URL filtering to drop versioned docs, next docs, search, blog tags, and blog pagination.
  • Update robots.txt to disallow the same low-value/versioned paths for both EN and ZH locales.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
scripts/update-sitemap-loc.js Adds exclusion rules and filters merged sitemap entries to reduce bloat/duplicates.
website/static/robots.txt Adds disallow rules to keep robots directives consistent with the filtered sitemap.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

const before = urls.length;
sitemap.urlset.url = urls.filter((entry) => {
const loc = entry.loc && entry.loc._text;
return !loc || !shouldExclude(loc);
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently keeps entries that don’t have a valid loc (return !loc || ...). In a sitemap, entries without loc are invalid and should be removed to avoid generating a malformed sitemap. Consider changing the predicate to require loc and then apply the exclude filter (i.e., drop entries without loc).

Suggested change
return !loc || !shouldExclude(loc);
return Boolean(loc) && !shouldExclude(loc);

Copilot uses AI. Check for mistakes.
@Yilialinn Yilialinn merged commit c667ef6 into apache:master Apr 13, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants