Skip to content

Slow logdir discovery on cloud filesystems due to level-by-level globbing #7088

@bzantium

Description

@bzantium

Summary

When using --logdir with a cloud filesystem path (e.g., gs://bucket/experiments/), TensorBoard's GetLogdirSubdirectories uses ListRecursivelyViaGlobbing which globs level by level (*, */*, */*/*, ...), listing every file at every directory depth. This is extremely slow when the directory tree contains many non-event files such as model checkpoints.

Steps to reproduce

  1. Have a GCS directory structure like:
    gs://bucket/experiments/
      experiment-a/
        checkpoints/     # contains thousands of files
        tensorboard/     # contains a few tfevents files
      experiment-b/
        checkpoints/
        tensorboard/
    
  2. Run: tensorboard --logdir gs://bucket/experiments/ --load_fast=false
  3. Observe that the initial data loading takes ~100 seconds.

Expected behavior

TensorBoard should discover the event file directories in a few seconds.

Root cause

ListRecursivelyViaGlobbing iterates level by level, listing all files at each depth:

  • Level 0: experiments/* → 2 entries
  • Level 1: experiments/*/* → 4 entries
  • Level 2: experiments/*/*/* → 27 entries
  • ...
  • Level 5: experiments/*/*/*/*/* → 10,963 entries (mostly checkpoint shards)

Most of these files are irrelevant checkpoint data. GCS does not support server-side pattern filtering, so each glob level requires listing all objects under the prefix.

Proposed fix

For cloud paths, replace the level-by-level globbing with a single targeted recursive glob **/*tfevents*. This still lists all objects under the prefix (GCS limitation), but does so in a single API call instead of multiple level-by-level calls, and avoids the overhead of re-listing at each level.

Benchmark on a real GCS directory with ~26,000 checkpoint files and 8 event files:

  • Before: ~101 seconds
  • After: ~13 seconds

Local filesystem paths are unaffected (they use ListRecursivelyViaWalking).

Environment

  • TensorBoard 2.20.0
  • macOS (Apple Silicon)
  • Python 3.12
  • Using gcsfs for GCS filesystem support (no TensorFlow installed)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions