Skip to content

Fast-path cell drawing + position-based tokenizer#39

Open
billdenney wants to merge 3 commits into
mainfrom
claude/perf-round2
Open

Fast-path cell drawing + position-based tokenizer#39
billdenney wants to merge 3 commits into
mainfrom
claude/perf-round2

Conversation

@billdenney
Copy link
Copy Markdown
Member

Round-3 profile (after D-43/D-44) showed .draw_cell_text() at the call site accumulating 73% of core_paginate's total time. The dominant costs inside it were viewport push/pop overhead (~20%) and per-cell text-width measurement and grid.text drawing.

Four changes layered on top of the round-2 baseline:

  1. Add a fast path in .draw_cell_text(): when text fits the column (needed <= col_width_in), skip the clip viewport entirely and draw directly in the parent viewport. Most numeric/short-categorical cells take this path. The clip viewport is still pushed when text might bleed.

  2. Drop the two convertUnit calls in the slow path that re-measured the just-pushed clip viewport's npc=1 dimensions. Those values are the literal clip_w and row_h we constructed it with.

  3. Pre-format each column's cell strings once via .fmt_cell_vec(data[[cs$col]][rows], na_str) before the row loop, replacing the per-cell .fmt_cell() call.

  4. Rework .tokenize_for_wrap() to track (cur_start, cur_end) positions and emit text via substr(s, cur_start, cur_end), instead of accumulating characters into cur_buf and pasting on flush. One C call per token replaces n element accumulations + paste.

Measured speedup (medians; baseline = round-2 HEAD at 7c9484c):
core_paginate 397 ms -> 270 ms (~32% faster)
core_small 146 ms -> 111 ms (~24% faster)
wrap_demos 2.97 s -> 2.80 s (~6% faster, 5-run avg)
core_wrap / figure_multi: unchanged (their bottlenecks lie elsewhere)

Documented as D-45 in design/DECISIONS.md. Full devtools::test() green.

billdenney and others added 3 commits May 12, 2026 08:58
Round-3 profile (after D-43/D-44) showed .draw_cell_text() at the call
site accumulating 73% of core_paginate's total time.  The dominant
costs inside it were viewport push/pop overhead (~20%) and per-cell
text-width measurement and grid.text drawing.

Four changes layered on top of the round-2 baseline:

1. Add a fast path in .draw_cell_text(): when text fits the column
   (needed <= col_width_in), skip the clip viewport entirely and draw
   directly in the parent viewport.  Most numeric/short-categorical
   cells take this path.  The clip viewport is still pushed when text
   might bleed.

2. Drop the two convertUnit calls in the slow path that re-measured the
   just-pushed clip viewport's npc=1 dimensions.  Those values are the
   literal clip_w and row_h we constructed it with.

3. Pre-format each column's cell strings once via
   .fmt_cell_vec(data[[cs$col]][rows], na_str) before the row loop,
   replacing the per-cell .fmt_cell() call.

4. Rework .tokenize_for_wrap() to track (cur_start, cur_end) positions
   and emit text via substr(s, cur_start, cur_end), instead of
   accumulating characters into cur_buf and pasting on flush.  One C
   call per token replaces n element accumulations + paste.

Measured speedup (medians; baseline = round-2 HEAD at 7c9484c):
  core_paginate    397 ms -> 270 ms   (~32% faster)
  core_small       146 ms -> 111 ms   (~24% faster)
  wrap_demos       2.97 s -> 2.80 s   (~6% faster, 5-run avg)
  core_wrap / figure_multi: unchanged (their bottlenecks lie elsewhere)

Documented as D-45 in design/DECISIONS.md.  Full devtools::test() green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Profile-driven follow-up to D-44.  The per-page clip-width cache
introduced there eliminated intra-page duplication but every page of
a multi-page table built a fresh env, so the same cell text (numeric
formats, visit labels, repeating categoricals) got re-measured on
each page.

Build one list of cache envs (one per column in resolved_cols) at the
top of tfl_table_to_pagelist() and thread it into every
build_table_grob() call.  All page-grobs hold a reference to the same
list; drawDetails reads x$clip_width_caches[x$col_group_idx] when the
slot is present and falls back to per-page envs otherwise.

Measured (medians, n=30):
  iris (150 rows, 5 pages)              264 ms -> 258 ms  (~2%)
  500-row synth (~17 pages, repeating
    visit/flag categoricals)            1.56 s -> 1.33 s  (~15% faster)

Short single-page tables: no change.  The gain scales with page count
and inter-page duplication, which is exactly the realistic
multi-page-clinical-listing case.

Documented as D-46 in design/DECISIONS.md.  New focused benchmark at
examples/bench_focused.R.  Full devtools::test() green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Profile-driven follow-up after D-46.  Pagination was constructing two
textGrobs per unique cell string -- one in .measure_max_string_width()
(Pass 1, widths) and another in measure_row_heights_tbl()'s inline
.memo_str_height() (Pass 2, heights).  Each construction re-ran grid's
validGP / set.gpar chain, the dominant per-call cost.

Add a single (gp_key, string) -> list(w, h) cache that lives for one
.tfl_table_to_pagelist_default() call.  Pass 1, Pass 2, and the
header-height pass all share it.  When a new entry is built, the
textGrob is consulted once and BOTH dimensions are cached, so whichever
pass needs the other dimension later gets it for free.

The gp_key namespace matches the one measure_row_heights_tbl() was
already using internally (e.g. "data_row_lh1.2", "header_row_lh1.2"),
so width-then-height on the same (category, string) hits the cache.

Multiple PDF scratch devices opened during pagination all share the
same (pg_width, pg_height) and font setup, so device-deterministic
metrics make sharing the cache across them safe.  The cache does NOT
cross into the render-device drawing phase -- that boundary still goes
through .draw_cell_text()'s separate re-measurement (D-44/D-46
territory).

Measured speedup (n=30 iterations, min-of-mins across 3 independent
runs):
  iris5p   (150 rows, 5 pages)              329 ms -> 255 ms  (~22% ↓)
  big_df   (500 rows, ~17 pages)            1.45 s -> 1.36 s  (~6%  ↓)

iris5p is the targeted scenario for pagination-heavy work; big_df is
dominated by drawing and gains less.

Documented as D-47 in design/DECISIONS.md.  Full devtools::test() green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant