Releases: GrainWare/WaybackWhen
1.3 GUI
What's Changed
- Update docs, add requirements.txt by @squat9042 in #29
- New theme by @squat9042 in #30
- V1.3 GUI by @Hog185 in #31
New Contributors
- @squat9042 made their first contribution in #29
Full Changelog: V1.2...1.3
V1.2
Full Changelog: V1.1...V1.2
V1.1
Wayback When
Wayback When is a tool that crawls a website and saves its pages to the Internet Archive’s Wayback Machine. It uses a headless browser to load pages the same way a real visitor would, so it can find links that only appear after scripts run. As it crawls, it keeps track of every internal link it discovers. Before archiving anything, it checks when the page was last saved. If the page was archived recently, it skips it. If it hasn’t been saved in a while, it sends it to the Wayback Machine. The goal is to make website preservation easier, faster, and less repetitive. Instead of manually checking pages or wasting time on duplicates, Wayback When handles the crawling, the decision‑making, and the archiving for you.
Scraper
Wayback When uses a Selenium‑based scraper to explore a website and collect every link it can find. Instead of looking only at the raw HTML, it loads each page in a full browser environment, just like a real visitor. This allows it to find every link while remaining invisible to anti-scraping protections.
Archiver
The archiver decides which pages actually need to be saved. For every link the scraper finds, it checks the Wayback Machine to see when the page was last archived. If the snapshot is recent, it skips it. If it’s old or missing, it sends a new save request. It also handles rate limits and retries so the process can run for long periods without manual supervision.
V1.1 Release
New Additions and Enhancements in V1.1
New Imports and Reorganization
Reworked import grouping into clear sections: Selenium, visualization, Jupyter helpers.
Consolidated collections imports to include deque alongside OrderedDict.
Refactored Architecture
Introduced classes: WebDriverManager, Crawler, and Archiver to encapsulate driver lifecycle, crawling, and archiving responsibilities.
Replaced many procedural globals and helper wrappers with class methods for improved lifecycle management and testability.
New Exceptions
ConnectionRefusedForCrawlerError: Raised to abort crawling a branch when the browser reports a connection-refused error.
CaptchaDetectedError retained and clarified as a dedicated CAPTCHA signal.
Updated SETTINGS Dictionary
Changed defaults and added new keys:
archiving_cooldown increased to 90 days.
max_crawler_workers default set to 10 (0 still supported as unlimited).
retries default set to 3.
New keys: min_link_search_delay, max_link_search_delay, safety_switch, proxies, max_archiving_queue_size, allow_external_links, archive_timeout_seconds.
max_archiver_workers retained and clarified (0 = unlimited).
Requests-first Fast Path
Added _try_requests_first() to attempt a lightweight requests + BeautifulSoup crawl before falling back to Selenium, improving speed and reducing resource usage for simple pages.
Improved WebDriver Management
WebDriverManager.create_driver() centralizes driver creation, adds proxy support, experimental prefs, implicitly_wait(10), and consistent stealth application.
WebDriverManager.destroy_driver() ensures safe driver.quit() cleanup.
Enhanced Crawling Logic
_get_links_from_page_content() replaces the older get_internal_links() with:
Better CAPTCHA detection (more indicators).
Connection-refused detection that raises ConnectionRefusedForCrawlerError.
Respect for SETTINGS["allow_external_links"] and is_irrelevant_link() filtering.
Optional visual relationship collection when enable_visual_tree_generation is enabled.
crawl_single_page() now tries the fast requests path first, then Selenium if needed.
New Utility is_irrelevant_link
Centralized logic to filter out assets and irrelevant paths using an expanded IRRELEVANT_EXTENSIONS and IRRELEVANT_PATH_SEGMENTS list.
HTTP Session Factory
get_requests_session() returns a configured requests.Session with retry strategy and optional proxy selection.
Archiving Improvements
Archiver.should_archive() and Archiver.process_link_for_archiving() replace the old procedural archiving functions.
Archiving now runs wb_obj.save() inside a dedicated thread and enforces archive_timeout_seconds to avoid indefinite blocking.
Reactive global cooldown: rate_limit_active_until_time is set when Wayback rate limits are detected to coordinate pauses across threads.
Improved rate-limit handling and clearer failure messages ([FAILED - TIMEOUT], reactive sleeps).
Concurrency and Rate-limiting
Cleaner use of ThreadPoolExecutor with explicit worker limits.
Implemented DFS instead of BFS
Global archive_lock, last_archive_time, and rate_limit_active_until_time coordinate per-thread and global rate limiting.
Logging and Typing
log_message(level, message, debug_only=False) retained and used consistently across modules.
Several functions now include type hints for clarity and maintainability.
Visualization Integration
networkx and matplotlib.pyplot remain available for visual tree generation; relationships are now collected in a structured way by the crawler class for later plotting.
Notable Changes and Fixes
URL Normalization and Filtering
normalize_url() rewritten to normalize paths, remove duplicate slashes, strip index pages, lowercase paths, and produce a sorted query string.
is_irrelevant_link() now aggressively filters many asset types and common CMS/static path segments to reduce noise.
Behavioral Changes
Default behavior is more conservative (longer archiving cooldown, debug enabled, limited crawler workers). Update SETTINGS to restore previous aggressive defaults if desired.
The crawler no longer requires discovered links to be strict sub-paths of the base URL; allow_external_links controls whether external domains are permitted.
Better Link Handling
Robustness Fixes
Fixed potential indefinite blocking on wb_obj.save() by adding a timeout and threaded execution.
Improved handling of WebDriver connection errors to avoid endless retries on unreachable branches.
Added proxy support for both requests sessions and Selenium driver options.
Cleaner Output
Cleaner Terminal Output
Removed and Deprecated
Removed or Replaced
Procedural orchestration functions such as crawl_website and wrapper_get_internal_links were replaced by class-based equivalents and crawl_single_page.
The previous global pattern of long-lived thread-local drivers is reduced in favor of explicit create/destroy per crawl where appropriate.
Deprecated
Relying on 0 to mean "unlimited" is still supported but discouraged; explicit numeric limits are recommended for production runs.
V1
V1 Release
New Additions and Enhancements in V1
-
New Imports:
random: For generating random user-agents and sleep times.selenium-stealth: To evade bot detection while using Selenium.networkxandmatplotlib.pyplot: For generating and visualizing the link tree.
-
Updated
SETTINGSDictionary:- Renamed
'archiving_retries'to a more general'retries'. - Added
'debug_mode': To toggle detailed logging. - Added
'max_archiver_workers': To control concurrency for archiving tasks. - Added
'enable_visual_tree_generation': To enable/disable the visual link tree output.
- Renamed
-
New Helper Functions:
log_message(level, message, debug_only=False): Standardized logging function.get_root_domain(netloc): To extract the root domain from a URL's network location.generate_random_user_agent(): To create varied user-agents for requests.
-
Enhanced
get_driver()Function:- Now applies
selenium-stealthwith randomized platforms, webgl vendors, and renderers for better bot evasion.
- Now applies
-
Improved
get_internal_links()Function:- Uses
generate_random_user_agent()for each request. - Includes a crucial check
clean_url.startswith(normalized_base_url_for_comparison)to ensure discovered links are sub-paths of the base URL, preventing navigation to parent directories. - Utilizes the
get_root_domain()function for more accurate domain comparison. - Incorporates
log_message()for structured logging. - Removed older explicit filtering for
IRRELEVANT_EXTENSIONSandIRRELEVANT_PATH_SEGMENTSas the sub-path check makes it less necessary. - CAPTCHA detection is more robust with
captcha_prompt_lockand a customCaptchaDetectedError(though currently handled by automated wait and retry). - Retry delays are now randomized (
time.sleep(random.uniform(5, 15))).
- Uses
-
Modified
should_archive()Function:- Uses
generate_random_user_agent()for Wayback Machine requests. - Uses
log_message()for clear output. - Refers to the general
'retries'setting.
- Uses
-
Modified
process_link_for_archiving()Function:- Uses
log_message()for consistent output. - Refers to the general
'retries'setting. - Rate limit warnings use
log_message().
- Uses
-
Refactored
crawl_website()Function:- Now accepts
archiver_executor,archiving_futures,global_archive_action, andlink_relationshipsto support parallel archiving and visual graph data collection. - Uses
wrapper_get_internal_links()to manage thread-local WebDriver instances. - Submits links for archiving immediately upon discovery if an
archiver_executoris provided. - Utilizes
log_message()for all output.
- Now accepts
-
Major Overhaul of
main()Function:- Introduced
all_link_relationships,archiving_futures, andcrawling_futuresto manage concurrent operations and graph data. - Implemented
concurrent.futures.ThreadPoolExecutorfor both crawling and archiving, allowing parallel execution. - Revised archiving action logic to print consolidated messages based on the global choice.
- New Visual Tree Generation Logic:
- Generates a directed graph (
nx.DiGraph()) usingnetworkx. - Adds nodes and edges based on
all_link_relationships. - Positions the
mother_url(first initial URL) at the center and makes it visually distinct (larger node, red color). - Dynamic and Truncated Node Labels: Labels are now derived from the domain and path, and are smartly truncated to a
max_label_length(e.g., 25 characters) to prevent overlap, prioritizing the domain and a portion of the path. - The font size of labels is dynamically adjusted (
font_size = max(2, min(6, 400 // (len(G.nodes()) + len(initial_urls))))) to scale based on the number of nodes and initial URLs, minimizing overlap. - Increases the figure size (
figsize=(16, 12)) for better readability of the graph. - Saves the plot as 'website_link_tree.png'.
- Generates a directed graph (
- Introduced