Fast-path cell drawing + position-based tokenizer by billdenney · Pull Request #39 · humanpred/writetfl

billdenney · 2026-05-12T13:03:26Z

Round-3 profile (after D-43/D-44) showed .draw_cell_text() at the call site accumulating 73% of core_paginate's total time. The dominant costs inside it were viewport push/pop overhead (~20%) and per-cell text-width measurement and grid.text drawing.

Four changes layered on top of the round-2 baseline:

Add a fast path in .draw_cell_text(): when text fits the column (needed <= col_width_in), skip the clip viewport entirely and draw directly in the parent viewport. Most numeric/short-categorical cells take this path. The clip viewport is still pushed when text might bleed.
Drop the two convertUnit calls in the slow path that re-measured the just-pushed clip viewport's npc=1 dimensions. Those values are the literal clip_w and row_h we constructed it with.
Pre-format each column's cell strings once via .fmt_cell_vec(data[[cs$col]][rows], na_str) before the row loop, replacing the per-cell .fmt_cell() call.
Rework .tokenize_for_wrap() to track (cur_start, cur_end) positions and emit text via substr(s, cur_start, cur_end), instead of accumulating characters into cur_buf and pasting on flush. One C call per token replaces n element accumulations + paste.

Measured speedup (medians; baseline = round-2 HEAD at 7c9484c):
core_paginate 397 ms -> 270 ms (~32% faster)
core_small 146 ms -> 111 ms (~24% faster)
wrap_demos 2.97 s -> 2.80 s (~6% faster, 5-run avg)
core_wrap / figure_multi: unchanged (their bottlenecks lie elsewhere)

Documented as D-45 in design/DECISIONS.md. Full devtools::test() green.

Round-3 profile (after D-43/D-44) showed .draw_cell_text() at the call site accumulating 73% of core_paginate's total time. The dominant costs inside it were viewport push/pop overhead (~20%) and per-cell text-width measurement and grid.text drawing. Four changes layered on top of the round-2 baseline: 1. Add a fast path in .draw_cell_text(): when text fits the column (needed <= col_width_in), skip the clip viewport entirely and draw directly in the parent viewport. Most numeric/short-categorical cells take this path. The clip viewport is still pushed when text might bleed. 2. Drop the two convertUnit calls in the slow path that re-measured the just-pushed clip viewport's npc=1 dimensions. Those values are the literal clip_w and row_h we constructed it with. 3. Pre-format each column's cell strings once via .fmt_cell_vec(data[[cs$col]][rows], na_str) before the row loop, replacing the per-cell .fmt_cell() call. 4. Rework .tokenize_for_wrap() to track (cur_start, cur_end) positions and emit text via substr(s, cur_start, cur_end), instead of accumulating characters into cur_buf and pasting on flush. One C call per token replaces n element accumulations + paste. Measured speedup (medians; baseline = round-2 HEAD at 7c9484c): core_paginate 397 ms -> 270 ms (~32% faster) core_small 146 ms -> 111 ms (~24% faster) wrap_demos 2.97 s -> 2.80 s (~6% faster, 5-run avg) core_wrap / figure_multi: unchanged (their bottlenecks lie elsewhere) Documented as D-45 in design/DECISIONS.md. Full devtools::test() green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Profile-driven follow-up to D-44. The per-page clip-width cache introduced there eliminated intra-page duplication but every page of a multi-page table built a fresh env, so the same cell text (numeric formats, visit labels, repeating categoricals) got re-measured on each page. Build one list of cache envs (one per column in resolved_cols) at the top of tfl_table_to_pagelist() and thread it into every build_table_grob() call. All page-grobs hold a reference to the same list; drawDetails reads x$clip_width_caches[x$col_group_idx] when the slot is present and falls back to per-page envs otherwise. Measured (medians, n=30): iris (150 rows, 5 pages) 264 ms -> 258 ms (~2%) 500-row synth (~17 pages, repeating visit/flag categoricals) 1.56 s -> 1.33 s (~15% faster) Short single-page tables: no change. The gain scales with page count and inter-page duplication, which is exactly the realistic multi-page-clinical-listing case. Documented as D-46 in design/DECISIONS.md. New focused benchmark at examples/bench_focused.R. Full devtools::test() green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Profile-driven follow-up after D-46. Pagination was constructing two textGrobs per unique cell string -- one in .measure_max_string_width() (Pass 1, widths) and another in measure_row_heights_tbl()'s inline .memo_str_height() (Pass 2, heights). Each construction re-ran grid's validGP / set.gpar chain, the dominant per-call cost. Add a single (gp_key, string) -> list(w, h) cache that lives for one .tfl_table_to_pagelist_default() call. Pass 1, Pass 2, and the header-height pass all share it. When a new entry is built, the textGrob is consulted once and BOTH dimensions are cached, so whichever pass needs the other dimension later gets it for free. The gp_key namespace matches the one measure_row_heights_tbl() was already using internally (e.g. "data_row_lh1.2", "header_row_lh1.2"), so width-then-height on the same (category, string) hits the cache. Multiple PDF scratch devices opened during pagination all share the same (pg_width, pg_height) and font setup, so device-deterministic metrics make sharing the cache across them safe. The cache does NOT cross into the render-device drawing phase -- that boundary still goes through .draw_cell_text()'s separate re-measurement (D-44/D-46 territory). Measured speedup (n=30 iterations, min-of-mins across 3 independent runs): iris5p (150 rows, 5 pages) 329 ms -> 255 ms (~22% ↓) big_df (500 rows, ~17 pages) 1.45 s -> 1.36 s (~6% ↓) iris5p is the targeted scenario for pagination-heavy work; big_df is dominated by drawing and gains less. Documented as D-47 in design/DECISIONS.md. Full devtools::test() green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

billdenney and others added 3 commits May 12, 2026 08:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast-path cell drawing + position-based tokenizer#39

Fast-path cell drawing + position-based tokenizer#39
billdenney wants to merge 3 commits into
mainfrom
claude/perf-round2

billdenney commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

billdenney commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant