expire: skip per-article SMretrieve via cancel tombstone log#340
expire: skip per-article SMretrieve via cancel tombstone log#340kev009 wants to merge 2 commits into
Conversation
|
A great idea! Thanks for it, Kevin. |
|
This one is definitely more invasive than the expireover change (which is basically a free lunch), and it is less critical (i.e. on my spool expire takes about 27h before). But they are designed to work together, and once effective (after first run) this will make expire a seconds to minutes runtime even for a very busy site. And with that kind of speedup, the option also opens up to run these more frequently and disable nnrpdcheckart. |
|
Isn't there any need from sometimes doing a full legacy run? I am wondering whether we won't have any orphan never expired even after a few runs either with the bloom filter or this new feature. |
|
@Julien-Elie yes, the gap is manually 'rm' something or filesystem corruption, in which case a full expire run would be necessary. We need to make sure the documentation about that is clear, these are intended for healthy spools where inn tools are used to manage the spool. If that is invalidated, the admin needs to do a patrol expire. |
9ce35a5 to
a18db3d
Compare
InterNetNews/inn#340 Gated by EXPERIMENTAL option Move all patches to local to handle merge conflicts.
|
Actual runtime on my system. It takes a little longer than I hypothesized because I forgot the actual data copy cost of writing out a new history file. But plenty fast: |
Adds an opt-in (expiretombstone in inn.conf, default false)
side-channel that records every cancellation alongside the existing
storage and overview deletions, so a subsequent expire run can drop
matching history entries without doing per-article
SMretrieve(RETR_STAT) syscalls. At billion-article scale on
tradspool/timehash this turns a billion access(2) calls into a few
thousand hash lookups.
Two log files cooperate:
${pathdb}/expireover.tombstone -- written by expireover after each
successful inline SMcancel; in delayrm mode written up front and
promoted by expirerm after fastrm succeeds. Atomic .NEW -> final
rename under an exclusive non-blocking fcntl POSIX lock. A
crashed-or-failed previous run's leftover .NEW is verified
per-token via SMretrieve and merged into the next run, so partial
rmfile failures cannot orphan articles on disk.
${pathdb}/cancels.tombstone -- appended by innd's ARTcancel and by
sm -r through a new public SMcanceltombstone() helper. Shared
fcntl POSIX lock for appenders, exclusive for expire's snapshot
rename to .processing; one retry on EAGAIN closes the appender
vs consumer race window. POSIX O_APPEND atomicity for sub-
PIPE_BUF writes keeps single-line tokens from interleaving.
expire reads both files into one hashset, treats every article in
either log as already gone, and falls back to per-article SMretrieve
only for storage methods where SMprobe(SELFEXPIRE) is true (CNFS
wrap-around). An empty-but-present tombstone correctly says "no
cancels this cycle" and skips SMretrieve for everything; truly
absent files fall back to the slow path.
The format is line-per-token; readers ignore blank and #-prefixed
lines, and expireover writes "# inn-tombstone v1" as a forward-
compatibility header. Disk and memory footprint scale linearly:
~38 bytes per entry on disk, ~50 bytes in expire's hash table.
Tests cover the public library (tombstone_hash_create, _read,
_rename_for_processing, _present), SMcanceltombstone branches
including expiretombstone=false bypass and unwritable-pathdb
non-crash, two-file merge with dedup, comment-line skip,
HISexpire integration with a callback that mirrors EXPdoline's
full decision tree (kept / tombstoned / selfexpire-gone /
selfexpire-alive), and an end-to-end shell test that exercises
sm + cancels.tombstone + expiretombstone toggling.
Default is false; sites should opt in after validating the
option's behavior against their workload.
When both innconf->expiretombstone and PERMaccessconf->nnrpdcheckart
are true, ARTinstorebytoken now consults ${pathdb}/cancels.tombstone
before calling SMretrieve(RETR_STAT). A token recorded in the log is
reported gone without a syscall; a token absent from the log is
trusted to still exist on storage methods that do not self-expire,
again without a syscall. Self-expiring backends (CNFS) still go
through SMretrieve because cyclic-buffer wrap-around bypasses the
tombstone.
The same trade-off as the expire-side fast path applies: the
tombstone catches in-band cancellations (innd ARTcancel, sm -r) but
not out-of-band events (manual filesystem deletes, corruption).
Admins running nnrpdcheckart=true to catch the brief overview-vs-
storage race during innd cancel processing, or to catch sm -r
artifacts before the next expireover cleans overview, still get those
paths. Loss is bounded to events the tombstone subsystem cannot
see, which is consistent with the contract documented for
expiretombstone.
Loading is lazy: the hashset is built on first call into
ARTinstorebytoken and the file is stat()ed on every subsequent call
to detect change, with reload only when mtime or size changes.
Statting a fixed path stays in the dentry cache and is trivially
cheap compared to the per-article SMretrieve syscalls the fast path
elides, so cancellations recorded by other processes become visible
to long-lived connections immediately. Freshness is keyed on both
mtime and size: mtime alone has 1-second granularity, so a rename-
and-recreate within the same second can land a fresh inode with
the cached mtime; size is monotonic-append between rotations and
catches every modification. Partial reads (mid-file ferror) leave
the cached freshness key untouched so the next call retries.
The SMprobe(SELFEXPIRE) probe is checked before the tombstone
refresh so pure-CNFS sites pay only that probe (a static per-method
attribute, no I/O) and skip both the hash lookup and the cache load
entirely.
Memory cost is dominated by the hashset (~50 bytes per cancel;
typical sites have hundreds of entries). When the file is missing
entirely, expiretombstone is off, or nnrpdcheckart is off, the
original SMretrieve path is unchanged.
inn.conf.pod's nnrpdcheckart entry now describes the fast-path
gating, the mtime/size-driven reload model, the manual-rm caveat
on non-self-expiring backends, and the OVERartcheck counter
becoming an undercount when the fast path is active.
Adds an opt-in (expiretombstone in inn.conf, default false)
side-channel that records every cancellation alongside the existing
storage and overview deletions, so a subsequent expire run can drop
matching history entries without doing per-article
SMretrieve(RETR_STAT) syscalls. At billion-article scale on
tradspool/timehash this turns a billion access(2) calls into a few
thousand hash lookups.
Two log files cooperate:
${pathdb}/expireover.tombstone -- written by expireover after each
successful inline SMcancel; in delayrm mode written up front and
promoted by expirerm after fastrm succeeds. Atomic .NEW -> final
rename under an exclusive non-blocking fcntl POSIX lock. A
crashed-or-failed previous run's leftover .NEW is verified
per-token via SMretrieve and merged into the next run, so partial
rmfile failures cannot orphan articles on disk.
${pathdb}/cancels.tombstone -- appended by innd's ARTcancel and by
sm -r through a new public SMcanceltombstone() helper. Shared
fcntl POSIX lock for appenders, exclusive for expire's snapshot
rename to .processing; one retry on EAGAIN closes the appender
vs consumer race window. POSIX O_APPEND atomicity for sub-
PIPE_BUF writes keeps single-line tokens from interleaving.
expire reads both files into one hashset, treats every article in
either log as already gone, and falls back to per-article SMretrieve
only for storage methods where SMprobe(SELFEXPIRE) is true (CNFS
wrap-around). An empty-but-present tombstone correctly says "no
cancels this cycle" and skips SMretrieve for everything; truly absent
files fall back to the slow path.
The format is line-per-token; readers ignore blank and #-prefixed
lines, and expireover writes "# inn-tombstone v1" as a forward-
compatibility header. Disk and memory footprint scale linearly:
~38 bytes per entry on disk, ~50 bytes in expire's hash table.
Tests cover the public library (tombstone_hash_create, _read,
_rename_for_processing, _present), SMcanceltombstone branches
including expiretombstone=false bypass and unwritable-pathdb
non-crash, two-file merge with dedup, comment-line skip,
HISexpire integration with a callback that mirrors EXPdoline's
full decision tree (kept / tombstoned / selfexpire-gone /
selfexpire-alive), and an end-to-end shell test that exercises
sm + cancels.tombstone + expiretombstone toggling.
Default is false; sites should opt in after validating the
option's behavior against their workload.
see #340
When both innconf->expiretombstone and PERMaccessconf->nnrpdcheckart
are true, ARTinstorebytoken now consults ${pathdb}/cancels.tombstone
before calling SMretrieve(RETR_STAT). A token recorded in the log is
reported gone without a syscall; a token absent from the log is
trusted to still exist on storage methods that do not self-expire,
again without a syscall. Self-expiring backends (CNFS) still go
through SMretrieve because cyclic-buffer wrap-around bypasses the
tombstone.
The same trade-off as the expire-side fast path applies: the
tombstone catches in-band cancellations (innd ARTcancel, sm -r) but
not out-of-band events (manual filesystem deletes, corruption).
Admins running nnrpdcheckart=true to catch the brief overview-vs-
storage race during innd cancel processing, or to catch sm -r
artifacts before the next expireover cleans overview, still get those
paths. Loss is bounded to events the tombstone subsystem cannot
see, which is consistent with the contract documented for
expiretombstone.
Loading is lazy: the hashset is built on first call into
ARTinstorebytoken and the file is stat()ed on every subsequent call
to detect change, with reload only when mtime or size changes.
Statting a fixed path stays in the dentry cache and is trivially
cheap compared to the per-article SMretrieve syscalls the fast path
elides, so cancellations recorded by other processes become visible
to long-lived connections immediately. Freshness is keyed on both
mtime and size: mtime alone has 1-second granularity, so a rename-
and-recreate within the same second can land a fresh inode with
the cached mtime; size is monotonic-append between rotations and
catches every modification. Partial reads (mid-file ferror) leave
the cached freshness key untouched so the next call retries.
The SMprobe(SELFEXPIRE) probe is checked before the tombstone
refresh so pure-CNFS sites pay only that probe (a static per-method
attribute, no I/O) and skip both the hash lookup and the cache load
entirely.
Memory cost is dominated by the hashset (~50 bytes per cancel;
typical sites have hundreds of entries). When the file is missing
entirely, expiretombstone is off, or nnrpdcheckart is off, the
original SMretrieve path is unchanged.
inn.conf.pod's nnrpdcheckart entry now describes the fast-path
gating, the mtime/size-driven reload model, the manual-rm caveat
on non-self-expiring backends, and the OVERartcheck counter
becoming an undercount when the fast path is active.
see #340
|
A very clean and thoughtful patch. Thanks again Kevin for your work! I'm wondering whether the 2 expiretombstone-without-groupbaseexpiry checks wouldn't be better in the As for the use of |
|
Like for the improvements to Without expiretombstone: With expiretombstone: |
Adds an opt-in (expiretombstone in inn.conf, default false) side-channel that records every cancellation alongside the existing storage and overview deletions, so a subsequent expire run can drop matching history entries without doing per-article SMretrieve(RETR_STAT) syscalls. At million-article scale on tradspool/timehash this turns a million access(2) calls into a few thousand hash lookups.
Two log files cooperate:
${pathdb}/expireover.tombstone: written by expireover after each
successful inline SMcancel; in delayrm mode written up front and
promoted by expirerm after fastrm succeeds. Atomic .NEW -> final
rename under an exclusive non-blocking fcntl POSIX lock. A
crashed-or-failed previous run's leftover .NEW is verified
per-token via SMretrieve and merged into the next run, so partial
rmfile failures cannot orphan articles on disk.
${pathdb}/cancels.tombstone: appended by innd's ARTcancel and by
sm -r through a new public SMcanceltombstone() helper. Shared
fcntl POSIX lock for appenders, exclusive for expire's snapshot
rename to .processing; one retry on EAGAIN closes the appender
vs consumer race window. POSIX O_APPEND atomicity for sub-
PIPE_BUF writes keeps single-line tokens from interleaving.
expire reads both files into one hashset, treats every article in either log as already gone, and falls back to per-article SMretrieve only for storage methods where SMprobe(SELFEXPIRE) is true (CNFS wrap-around). An empty-but-present tombstone correctly says "no cancels this cycle" and skips SMretrieve for everything; truly absent files fall back to the slow path.
The format is line-per-token; readers ignore blank and #-prefixed lines, and expireover writes "# inn-tombstone v1" as a forward- compatibility header. Disk and memory footprint scale linearly: ~38 bytes per entry on disk, ~50 bytes in expire's hash table.
Tests cover the public library (tombstone_hash_create, _read, _rename_for_processing, _present), SMcanceltombstone branches including expiretombstone=false bypass and unwritable-pathdb non-crash, two-file merge with dedup, comment-line skip, HISexpire integration with a callback that mirrors EXPdoline's full decision tree (kept / tombstoned / selfexpire-gone / selfexpire-alive), and an end-to-end shell test that exercises sm + cancels.tombstone + expiretombstone toggling.
Default is false; sites should opt in after validating the option's behavior against their workload. Revisit default after wider testing.