Fix redirect target verification in AsyncUrlSeeder and enhance tests#1622
Fix redirect target verification in AsyncUrlSeeder and enhance tests#1622Ahmed-Tawfik94 wants to merge 1 commit intodevelopfrom
Conversation
- Added `verify_redirect_targets` parameter to control redirect verification. - Modified `_resolve_head()` to verify redirect targets based on the new parameter. - Implemented tests for both verification modes, ensuring dead redirects are filtered out and legacy behavior is preserved.
|
@Ahmed-Tawfik94 Why didn't you implement the recommended solution you provided in the root cause message here? #1603 (comment) |
According to my root cause analysis, |
|
Hi there! I was waiting for this issue to be merged. Is there a timeline when that would happen @Ahmed-Tawfik94 @ntohidi |
…1734, #1290, #1668) Bug fixes: - Verify redirect targets are alive before returning from URL seeder (#1622) - Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786) - Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796) Security/Docker: - Require api_token for /token endpoint when configured (#1795) - Deep-crawl streaming now mirrors Python library behavior via arun() (#1798) CI: - Bump GitHub Actions to latest versions - checkout v6, setup-python v6, build-push-action v6, setup-buildx v4, login v4 (#1734) Features: - Support type-list pipeline in JsonCssExtractionStrategy for chained extraction like ["attribute", "regex"] (#1290) - Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting for Unicode preservation in JSON output (#1668)
|
Thanks @Ahmed-Tawfik94 - this was a valid bug. Redirect targets were returned without verifying they were alive. We've incorporated a simplified version of this fix (inline verification, no new constructor parameter) into develop. It will be available in the next release. We'll acknowledge your contribution. |
Summary
Fixed a critical bug in AsyncUrlSeeder where
_resolve_head()was incorrectly returning redirect targets without verifying they were alive. This could cause dead URLs to be treated as valid during URL discovery.#1603
Key Changes:
verify_redirect_targetsparameter toAsyncUrlSeeder.__init__()(default:True)_resolve_head()to conditionally verify redirect targets based on the parameterBackward Compatibility: Existing code continues to work with improved behavior by default. Users can set
verify_redirect_targets=Falseto restore the previous behavior if needed.List of files changed and why
crawl4ai/async_url_seeder.py- Core bug fix and parameter additiontests/test_async_url_seeder.py- Added unit tests for both verification modestest_scripts/test_async_url_seeder_fixes.py- Comprehensive demo/test suite for all fixestest_scripts/README.md- Documentation for test scriptsHow Has This Been Tested?
Created comprehensive test suite covering:
verify_redirect_targets=FalseRun the test suite with:
python test_scripts/test_async_url_seeder_fixes.pyChecklist: