Fast-path cell drawing + position-based tokenizer#39
Open
billdenney wants to merge 3 commits into
Open
Conversation
Round-3 profile (after D-43/D-44) showed .draw_cell_text() at the call site accumulating 73% of core_paginate's total time. The dominant costs inside it were viewport push/pop overhead (~20%) and per-cell text-width measurement and grid.text drawing. Four changes layered on top of the round-2 baseline: 1. Add a fast path in .draw_cell_text(): when text fits the column (needed <= col_width_in), skip the clip viewport entirely and draw directly in the parent viewport. Most numeric/short-categorical cells take this path. The clip viewport is still pushed when text might bleed. 2. Drop the two convertUnit calls in the slow path that re-measured the just-pushed clip viewport's npc=1 dimensions. Those values are the literal clip_w and row_h we constructed it with. 3. Pre-format each column's cell strings once via .fmt_cell_vec(data[[cs$col]][rows], na_str) before the row loop, replacing the per-cell .fmt_cell() call. 4. Rework .tokenize_for_wrap() to track (cur_start, cur_end) positions and emit text via substr(s, cur_start, cur_end), instead of accumulating characters into cur_buf and pasting on flush. One C call per token replaces n element accumulations + paste. Measured speedup (medians; baseline = round-2 HEAD at 7c9484c): core_paginate 397 ms -> 270 ms (~32% faster) core_small 146 ms -> 111 ms (~24% faster) wrap_demos 2.97 s -> 2.80 s (~6% faster, 5-run avg) core_wrap / figure_multi: unchanged (their bottlenecks lie elsewhere) Documented as D-45 in design/DECISIONS.md. Full devtools::test() green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Profile-driven follow-up to D-44. The per-page clip-width cache
introduced there eliminated intra-page duplication but every page of
a multi-page table built a fresh env, so the same cell text (numeric
formats, visit labels, repeating categoricals) got re-measured on
each page.
Build one list of cache envs (one per column in resolved_cols) at the
top of tfl_table_to_pagelist() and thread it into every
build_table_grob() call. All page-grobs hold a reference to the same
list; drawDetails reads x$clip_width_caches[x$col_group_idx] when the
slot is present and falls back to per-page envs otherwise.
Measured (medians, n=30):
iris (150 rows, 5 pages) 264 ms -> 258 ms (~2%)
500-row synth (~17 pages, repeating
visit/flag categoricals) 1.56 s -> 1.33 s (~15% faster)
Short single-page tables: no change. The gain scales with page count
and inter-page duplication, which is exactly the realistic
multi-page-clinical-listing case.
Documented as D-46 in design/DECISIONS.md. New focused benchmark at
examples/bench_focused.R. Full devtools::test() green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Profile-driven follow-up after D-46. Pagination was constructing two textGrobs per unique cell string -- one in .measure_max_string_width() (Pass 1, widths) and another in measure_row_heights_tbl()'s inline .memo_str_height() (Pass 2, heights). Each construction re-ran grid's validGP / set.gpar chain, the dominant per-call cost. Add a single (gp_key, string) -> list(w, h) cache that lives for one .tfl_table_to_pagelist_default() call. Pass 1, Pass 2, and the header-height pass all share it. When a new entry is built, the textGrob is consulted once and BOTH dimensions are cached, so whichever pass needs the other dimension later gets it for free. The gp_key namespace matches the one measure_row_heights_tbl() was already using internally (e.g. "data_row_lh1.2", "header_row_lh1.2"), so width-then-height on the same (category, string) hits the cache. Multiple PDF scratch devices opened during pagination all share the same (pg_width, pg_height) and font setup, so device-deterministic metrics make sharing the cache across them safe. The cache does NOT cross into the render-device drawing phase -- that boundary still goes through .draw_cell_text()'s separate re-measurement (D-44/D-46 territory). Measured speedup (n=30 iterations, min-of-mins across 3 independent runs): iris5p (150 rows, 5 pages) 329 ms -> 255 ms (~22% ↓) big_df (500 rows, ~17 pages) 1.45 s -> 1.36 s (~6% ↓) iris5p is the targeted scenario for pagination-heavy work; big_df is dominated by drawing and gains less. Documented as D-47 in design/DECISIONS.md. Full devtools::test() green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Round-3 profile (after D-43/D-44) showed .draw_cell_text() at the call site accumulating 73% of core_paginate's total time. The dominant costs inside it were viewport push/pop overhead (~20%) and per-cell text-width measurement and grid.text drawing.
Four changes layered on top of the round-2 baseline:
Add a fast path in .draw_cell_text(): when text fits the column (needed <= col_width_in), skip the clip viewport entirely and draw directly in the parent viewport. Most numeric/short-categorical cells take this path. The clip viewport is still pushed when text might bleed.
Drop the two convertUnit calls in the slow path that re-measured the just-pushed clip viewport's npc=1 dimensions. Those values are the literal clip_w and row_h we constructed it with.
Pre-format each column's cell strings once via .fmt_cell_vec(data[[cs$col]][rows], na_str) before the row loop, replacing the per-cell .fmt_cell() call.
Rework .tokenize_for_wrap() to track (cur_start, cur_end) positions and emit text via substr(s, cur_start, cur_end), instead of accumulating characters into cur_buf and pasting on flush. One C call per token replaces n element accumulations + paste.
Measured speedup (medians; baseline = round-2 HEAD at 7c9484c):
core_paginate 397 ms -> 270 ms (~32% faster)
core_small 146 ms -> 111 ms (~24% faster)
wrap_demos 2.97 s -> 2.80 s (~6% faster, 5-run avg)
core_wrap / figure_multi: unchanged (their bottlenecks lie elsewhere)
Documented as D-45 in design/DECISIONS.md. Full devtools::test() green.