Skip to content

fix: stabilize best-first batch ordering#2004

Open
nightcityblade wants to merge 1 commit into
unclecode:mainfrom
nightcityblade:fix/issue-1998
Open

fix: stabilize best-first batch ordering#2004
nightcityblade wants to merge 1 commit into
unclecode:mainfrom
nightcityblade:fix/issue-1998

Conversation

@nightcityblade
Copy link
Copy Markdown
Contributor

Summary

Fixes #1998 by making best-first batch processing deterministic even when streamed crawl results complete out of order.

List of files changed and why

  • crawl4ai/deep_crawling/bff_strategy.py - collect concurrent batch results first, then process/yield them in priority-queue order before discovering links.
  • tests/deep_crawling/test_deep_crawl_resume.py - add a regression test for out-of-order streamed batch results.

How Has This Been Tested?

  • pytest tests/deep_crawling/test_deep_crawl_resume.py::TestBestFirstResume::test_stream_results_follow_priority_order_with_out_of_order_batch -q
  • pytest tests/deep_crawling/test_deep_crawl_resume.py::TestBestFirstResume -q

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: BestFirstCrawlingStrategy._arun_best_first interleaves link discovery with concurrent batch streaming, producing non-deterministic page sets

1 participant