Skip to content

Fix redirect target verification in AsyncUrlSeeder and enhance tests#1622

Closed
Ahmed-Tawfik94 wants to merge 1 commit intodevelopfrom
fix-async-url-seeder-redirect-verification
Closed

Fix redirect target verification in AsyncUrlSeeder and enhance tests#1622
Ahmed-Tawfik94 wants to merge 1 commit intodevelopfrom
fix-async-url-seeder-redirect-verification

Conversation

@Ahmed-Tawfik94
Copy link
Copy Markdown
Contributor

Summary

Fixed a critical bug in AsyncUrlSeeder where _resolve_head() was incorrectly returning redirect targets without verifying they were alive. This could cause dead URLs to be treated as valid during URL discovery.
#1603
Key Changes:

  • Added verify_redirect_targets parameter to AsyncUrlSeeder.__init__() (default: True)
  • Modified _resolve_head() to conditionally verify redirect targets based on the parameter
  • Enhanced documentation with parameter descriptions
  • Added comprehensive test suite for regression testing

Backward Compatibility: Existing code continues to work with improved behavior by default. Users can set verify_redirect_targets=False to restore the previous behavior if needed.

List of files changed and why

  • crawl4ai/async_url_seeder.py - Core bug fix and parameter addition
  • tests/test_async_url_seeder.py - Added unit tests for both verification modes
  • test_scripts/test_async_url_seeder_fixes.py - Comprehensive demo/test suite for all fixes
  • test_scripts/README.md - Documentation for test scripts

How Has This Been Tested?

Created comprehensive test suite covering:

  • ✅ Dead redirect targets are filtered out (default behavior)
  • ✅ Legacy behavior preserved when verify_redirect_targets=False
  • ✅ Valid redirects still work correctly
  • ✅ Direct URL validation unchanged
  • ✅ Context manager functionality preserved
  • ✅ All existing AsyncUrlSeeder functionality works
  • ✅ Integration with SeedingConfig.live_check parameter

Run the test suite with: python test_scripts/test_async_url_seeder_fixes.py

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

- Added `verify_redirect_targets` parameter to control redirect verification.
- Modified `_resolve_head()` to verify redirect targets based on the new parameter.
- Implemented tests for both verification modes, ensuring dead redirects are filtered out and legacy behavior is preserved.
@ntohidi
Copy link
Copy Markdown
Collaborator

ntohidi commented Nov 27, 2025

@Ahmed-Tawfik94 Why didn't you implement the recommended solution you provided in the root cause message here? #1603 (comment)

@Ahmed-Tawfik94
Copy link
Copy Markdown
Contributor Author

@Ahmed-Tawfik94 Why didn't you implement the recommended solution you provided in the root cause message here? #1603 (comment)

According to my root cause analysis, _resolve_head returns redirect targets without confirming that they resolve to 2xx. Setting follow_redirects=True and only returning the final URL if it's 2xx was my suggested solution. it might cause an issue for users who are using the functionality like this and might be a breaking change.
while by usingverify_redirect_targets parameter for single-level verification with an opt-out, making it more conservative and backward-compatible.

@Jozurf
Copy link
Copy Markdown

Jozurf commented Dec 19, 2025

Hi there! I was waiting for this issue to be merged. Is there a timeline when that would happen @Ahmed-Tawfik94 @ntohidi

unclecode added a commit that referenced this pull request Mar 7, 2026
…1734, #1290, #1668)

Bug fixes:
- Verify redirect targets are alive before returning from URL seeder (#1622)
- Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786)
- Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796)

Security/Docker:
- Require api_token for /token endpoint when configured (#1795)
- Deep-crawl streaming now mirrors Python library behavior via arun() (#1798)

CI:
- Bump GitHub Actions to latest versions - checkout v6, setup-python v6,
  build-push-action v6, setup-buildx v4, login v4 (#1734)

Features:
- Support type-list pipeline in JsonCssExtractionStrategy for chained
  extraction like ["attribute", "regex"] (#1290)
- Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting
  for Unicode preservation in JSON output (#1668)
@unclecode
Copy link
Copy Markdown
Owner

Thanks @Ahmed-Tawfik94 - this was a valid bug. Redirect targets were returned without verifying they were alive. We've incorporated a simplified version of this fix (inline verification, no new constructor parameter) into develop. It will be available in the next release. We'll acknowledge your contribution.

@unclecode unclecode closed this Mar 7, 2026
unclecode added a commit that referenced this pull request Mar 7, 2026
@ntohidi ntohidi mentioned this pull request Mar 16, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants