01 Mar 16:21

Hog185

6ce06e6

1.3 GUI Latest

Latest

What's Changed

Update docs, add requirements.txt by @squat9042 in #29
New theme by @squat9042 in #30
V1.3 GUI by @Hog185 in #31

New Contributors

@squat9042 made their first contribution in #29

Full Changelog: V1.2...1.3

Contributors

squat9042 and Hog185

Assets 2

15 Feb 13:45

Hog185

V1.2

2136243

V1.2

Full Changelog: V1.1...V1.2

Assets 2

29 Jan 09:56

Hog185

V1.1

11b1c84

V1.1

Wayback When

Wayback When is a tool that crawls a website and saves its pages to the Internet Archive’s Wayback Machine. It uses a headless browser to load pages the same way a real visitor would, so it can find links that only appear after scripts run. As it crawls, it keeps track of every internal link it discovers. Before archiving anything, it checks when the page was last saved. If the page was archived recently, it skips it. If it hasn’t been saved in a while, it sends it to the Wayback Machine. The goal is to make website preservation easier, faster, and less repetitive. Instead of manually checking pages or wasting time on duplicates, Wayback When handles the crawling, the decision‑making, and the archiving for you.
Scraper

Wayback When uses a Selenium‑based scraper to explore a website and collect every link it can find. Instead of looking only at the raw HTML, it loads each page in a full browser environment, just like a real visitor. This allows it to find every link while remaining invisible to anti-scraping protections.
Archiver

The archiver decides which pages actually need to be saved. For every link the scraper finds, it checks the Wayback Machine to see when the page was last archived. If the snapshot is recent, it skips it. If it’s old or missing, it sends a new save request. It also handles rate limits and retries so the process can run for long periods without manual supervision.
V1.1 Release
New Additions and Enhancements in V1.1

New Imports and Reorganization
    Reworked import grouping into clear sections: Selenium, visualization, Jupyter helpers.
    Consolidated collections imports to include deque alongside OrderedDict.

Refactored Architecture
    Introduced classes: WebDriverManager, Crawler, and Archiver to encapsulate driver lifecycle, crawling, and archiving responsibilities.
    Replaced many procedural globals and helper wrappers with class methods for improved lifecycle management and testability.

New Exceptions
    ConnectionRefusedForCrawlerError: Raised to abort crawling a branch when the browser reports a connection-refused error.
    CaptchaDetectedError retained and clarified as a dedicated CAPTCHA signal.

Updated SETTINGS Dictionary
    Changed defaults and added new keys:
        archiving_cooldown increased to 90 days.
        max_crawler_workers default set to 10 (0 still supported as unlimited).
        retries default set to 3.
        New keys: min_link_search_delay, max_link_search_delay, safety_switch, proxies, max_archiving_queue_size, allow_external_links, archive_timeout_seconds.
    max_archiver_workers retained and clarified (0 = unlimited).

Requests-first Fast Path
    Added _try_requests_first() to attempt a lightweight requests + BeautifulSoup crawl before falling back to Selenium, improving speed and reducing resource usage for simple pages.

Improved WebDriver Management
    WebDriverManager.create_driver() centralizes driver creation, adds proxy support, experimental prefs, implicitly_wait(10), and consistent stealth application.
    WebDriverManager.destroy_driver() ensures safe driver.quit() cleanup.

Enhanced Crawling Logic
    _get_links_from_page_content() replaces the older get_internal_links() with:
        Better CAPTCHA detection (more indicators).
        Connection-refused detection that raises ConnectionRefusedForCrawlerError.
        Respect for SETTINGS["allow_external_links"] and is_irrelevant_link() filtering.
        Optional visual relationship collection when enable_visual_tree_generation is enabled.
    crawl_single_page() now tries the fast requests path first, then Selenium if needed.

New Utility is_irrelevant_link
    Centralized logic to filter out assets and irrelevant paths using an expanded IRRELEVANT_EXTENSIONS and IRRELEVANT_PATH_SEGMENTS list.

HTTP Session Factory
    get_requests_session() returns a configured requests.Session with retry strategy and optional proxy selection.

Archiving Improvements
    Archiver.should_archive() and Archiver.process_link_for_archiving() replace the old procedural archiving functions.
    Archiving now runs wb_obj.save() inside a dedicated thread and enforces archive_timeout_seconds to avoid indefinite blocking.
    Reactive global cooldown: rate_limit_active_until_time is set when Wayback rate limits are detected to coordinate pauses across threads.
    Improved rate-limit handling and clearer failure messages ([FAILED - TIMEOUT], reactive sleeps).

Concurrency and Rate-limiting
    Cleaner use of ThreadPoolExecutor with explicit worker limits.
    Implemented DFS instead of BFS
    Global archive_lock, last_archive_time, and rate_limit_active_until_time coordinate per-thread and global rate limiting.

Logging and Typing
    log_message(level, message, debug_only=False) retained and used consistently across modules.
    Several functions now include type hints for clarity and maintainability.

Visualization Integration
    networkx and matplotlib.pyplot remain available for visual tree generation; relationships are now collected in a structured way by the crawler class for later plotting.

Notable Changes and Fixes

URL Normalization and Filtering
    normalize_url() rewritten to normalize paths, remove duplicate slashes, strip index pages, lowercase paths, and produce a sorted query string.
    is_irrelevant_link() now aggressively filters many asset types and common CMS/static path segments to reduce noise.

Behavioral Changes
    Default behavior is more conservative (longer archiving cooldown, debug enabled, limited crawler workers). Update SETTINGS to restore previous aggressive defaults if desired.
    The crawler no longer requires discovered links to be strict sub-paths of the base URL; allow_external_links controls whether external domains are permitted.
    Better Link Handling

Robustness Fixes
    Fixed potential indefinite blocking on wb_obj.save() by adding a timeout and threaded execution.
    Improved handling of WebDriver connection errors to avoid endless retries on unreachable branches.
    Added proxy support for both requests sessions and Selenium driver options.

Cleaner Output
    Cleaner Terminal Output

Removed and Deprecated

Removed or Replaced
    Procedural orchestration functions such as crawl_website and wrapper_get_internal_links were replaced by class-based equivalents and crawl_single_page.
    The previous global pattern of long-lived thread-local drivers is reduced in favor of explicit create/destroy per crawl where appropriate.

Deprecated
    Relying on 0 to mean "unlimited" is still supported but discouraged; explicit numeric limits are recommended for production runs.

Assets 2

14 Jan 22:21

Hog185

483214f

V1 Release

New Additions and Enhancements in V1

New Imports:
- random: For generating random user-agents and sleep times.
- selenium-stealth: To evade bot detection while using Selenium.
- networkx and matplotlib.pyplot: For generating and visualizing the link tree.
Updated SETTINGS Dictionary:
- Renamed 'archiving_retries' to a more general 'retries'.
- Added 'debug_mode': To toggle detailed logging.
- Added 'max_archiver_workers': To control concurrency for archiving tasks.
- Added 'enable_visual_tree_generation': To enable/disable the visual link tree output.
New Helper Functions:
- log_message(level, message, debug_only=False): Standardized logging function.
- get_root_domain(netloc): To extract the root domain from a URL's network location.
- generate_random_user_agent(): To create varied user-agents for requests.
Enhanced get_driver() Function:
- Now applies selenium-stealth with randomized platforms, webgl vendors, and renderers for better bot evasion.
Improved get_internal_links() Function:
- Uses generate_random_user_agent() for each request.
- Includes a crucial check clean_url.startswith(normalized_base_url_for_comparison) to ensure discovered links are sub-paths of the base URL, preventing navigation to parent directories.
- Utilizes the get_root_domain() function for more accurate domain comparison.
- Incorporates log_message() for structured logging.
- Removed older explicit filtering for IRRELEVANT_EXTENSIONS and IRRELEVANT_PATH_SEGMENTS as the sub-path check makes it less necessary.
- CAPTCHA detection is more robust with captcha_prompt_lock and a custom CaptchaDetectedError (though currently handled by automated wait and retry).
- Retry delays are now randomized (time.sleep(random.uniform(5, 15))).
Modified should_archive() Function:
- Uses generate_random_user_agent() for Wayback Machine requests.
- Uses log_message() for clear output.
- Refers to the general 'retries' setting.
Modified process_link_for_archiving() Function:
- Uses log_message() for consistent output.
- Refers to the general 'retries' setting.
- Rate limit warnings use log_message().
Refactored crawl_website() Function:
- Now accepts archiver_executor, archiving_futures, global_archive_action, and link_relationships to support parallel archiving and visual graph data collection.
- Uses wrapper_get_internal_links() to manage thread-local WebDriver instances.
- Submits links for archiving immediately upon discovery if an archiver_executor is provided.
- Utilizes log_message() for all output.
Major Overhaul of main() Function:
- Introduced all_link_relationships, archiving_futures, and crawling_futures to manage concurrent operations and graph data.
- Implemented concurrent.futures.ThreadPoolExecutor for both crawling and archiving, allowing parallel execution.
- Revised archiving action logic to print consolidated messages based on the global choice.
- New Visual Tree Generation Logic:
  - Generates a directed graph (nx.DiGraph()) using networkx.
  - Adds nodes and edges based on all_link_relationships.
  - Positions the mother_url (first initial URL) at the center and makes it visually distinct (larger node, red color).
  - Dynamic and Truncated Node Labels: Labels are now derived from the domain and path, and are smartly truncated to a max_label_length (e.g., 25 characters) to prevent overlap, prioritizing the domain and a portion of the path.
  - The font size of labels is dynamically adjusted (font_size = max(2, min(6, 400 // (len(G.nodes()) + len(initial_urls))))) to scale based on the number of nodes and initial URLs, minimizing overlap.
  - Increases the figure size (figsize=(16, 12)) for better readability of the graph.
  - Saves the plot as 'website_link_tree.png'.

Assets 2

13 Jan 09:54

Hog185

Beta-1.0

22b182e

beta-1.0 Pre-release

Pre-release

works fine but i wish to introduce a GUI in the future

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

V1 Release

New Additions and Enhancements in V1

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Releases: GrainWare/WaybackWhen

1.3 GUI

What's Changed

New Contributors

Contributors

Uh oh!

V1.2

Uh oh!

V1.1

Uh oh!

V1

V1 Release

New Additions and Enhancements in V1

Uh oh!

beta-1.0

Uh oh!