Skip to content

Commit 8db83bc

Browse files
More reachability logging documentation (#768)
1 parent da316cf commit 8db83bc

1 file changed

Lines changed: 42 additions & 1 deletion

File tree

mdbook/src/chapter_5/chapter_5_4.md

Lines changed: 42 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -224,7 +224,48 @@ The `<T>` in the stream names is the Rust type name of the dataflow's timestamp,
224224

225225
**`TimelyProgressEvent<T>`** captures the exchange of progress information between operators. Each event records whether it is a send or receive (`is_send`), the `source` worker, the `channel` and `seq_no`, the `identifier` of the operator, and two lists of updates: `messages` (updates to message counts at targets) and `internal` (updates to capabilities at sources). Each update is a tuple `(node, port, timestamp, delta)`. These are primarily useful for debugging the progress tracking protocol.
226226

227-
**`TrackerEvent<T>`** records updates to the reachability tracker, which maintains the set of timestamps that could still arrive at each operator input. The variants are `SourceUpdate` and `TargetUpdate`, each carrying the node, port, timestamp, and delta of the update.
227+
**`TrackerEvent<T>`** records updates to the reachability tracker, which maintains the set of timestamps that could still arrive at each operator port. Each scope (subgraph) has its own tracker, identified by `tracker_id` — this is the worker-unique `id` of the scope operator (the same `id` from `OperatesEvent`).
228+
229+
The tracker monitors two kinds of locations:
230+
231+
- **Targets** (operator input ports): timestamps of messages that may still arrive.
232+
- **Sources** (operator output ports): timestamps of capabilities that operators still hold.
233+
234+
The `TrackerEvent` enum has two variants:
235+
236+
| Variant | Fields | Description |
237+
|---------|--------|-------------|
238+
| `SourceUpdate` | `tracker_id`, `updates` | Changes to capability counts at operator output ports. |
239+
| `TargetUpdate` | `tracker_id`, `updates` | Changes to message counts at operator input ports. |
240+
241+
Each entry in `updates` is a tuple `(node, port, timestamp, delta)`:
242+
243+
| Field | Type | Description |
244+
|-------|------|-------------|
245+
| `node` | `usize` | Scope-local operator index (same convention as `ChannelsEvent` source/target indices, including 0 for the scope boundary). |
246+
| `port` | `usize` | Port index on that operator. |
247+
| `timestamp` | `T` | The timestamp being updated. |
248+
| `delta` | `i64` | The change in count: positive means a new capability or pending message; negative means one was retired. |
249+
250+
A `SourceUpdate` with positive `delta` means an operator has acquired (or retained) a capability to produce data at that timestamp on that output port. A negative `delta` means it has released one. Similarly, a `TargetUpdate` with positive `delta` means messages at that timestamp may still arrive at that input port; negative means some have been accounted for.
251+
252+
The frontier at any location is the set of timestamps with positive accumulated count. When all counts at a target reach zero for a given timestamp, the operator knows no more messages at that timestamp will arrive — this is the mechanism by which operators learn they can "close" a timestamp and make progress.
253+
254+
### Using Reachability Logging for Debugging
255+
256+
The `TrackerEvent` stream is particularly useful for diagnosing progress-tracking issues — for example, understanding why a dataflow appears stuck or why a particular timestamp hasn't completed.
257+
258+
**Reconstructing capability state.** Since each event carries a `delta`, you can reconstruct the current capability state at any point by accumulating deltas. For each `(tracker_id, node, port, timestamp)`, sum the deltas from all `SourceUpdate` events. A positive sum means the operator currently holds a capability at that timestamp on that port. When the sum reaches zero, the capability has been fully released.
259+
260+
The same applies to `TargetUpdate` events for message counts: a positive accumulated count at a target means messages at that timestamp may still be in flight.
261+
262+
**Identifying a stuck dataflow.** When a dataflow hangs, the accumulated state tells you exactly which operators hold capabilities and at which timestamps. Cross-reference the `tracker_id` with `OperatesEvent` to identify the scope, and the `node` with operator addresses within that scope (recall that `node` is a scope-local index, and the operator's full address is the scope's `addr` with `node` appended).
263+
264+
For example, if accumulated `SourceUpdate` deltas show that node 5 in tracker 7 holds a capability at timestamp `(42, 3)`, and `OperatesEvent` tells you tracker 7 is the scope at address `[0, 2]`, then the operator at address `[0, 2, 5]` holds the capability. Look up its name in the `OperatesEvent` log to identify it.
265+
266+
**Understanding why a frontier hasn't advanced.** The frontier at an operator's input can only advance when all upstream capabilities that could produce data at the current frontier timestamps have been released. The `SourceUpdate` events let you identify which operators still hold such capabilities. Trace the graph (using `ChannelsEvent` and operator summaries from `OperatesSummaryEvent`) from those capabilities forward to the stuck operator's input to understand the dependency chain.
267+
268+
**Matching scopes to log streams.** Each scope has its own tracker and its own log stream. A dataflow using `usize` timestamps with a nested iterative scope would produce two reachability streams: `"timely/reachability/usize"` for the root scope, and something like `"timely/reachability/Product<usize, u64>"` for the iterative scope (where the `u64` is the iteration counter). Register loggers for each stream you want to observe.
228269

229270
## Registering a Logger
230271

0 commit comments

Comments
 (0)